HERV-H RNA is abundant in human embryonic stem cells and a precise marker for pluripotency

Background Certain post-translational modifications to histones, including H3K4me3, as well as binding sites for the transcription factor STAT1, predict the site of integration of exogenous gamma-retroviruses with great accuracy and cell-type specificity. Statistical methods that were used to identify chromatin features that predict exogenous gamma-retrovirus integration site selection were exploited here to determine whether cell type-specific chromatin markers are enriched in the vicinity of endogenous retroviruses (ERVs). Results Among retro-elements in the human genome, the gamma-retrovirus HERV-H was highly associated with H3K4me3, though this association was only observed in embryonic stem (ES) cells (p < 10-300) and, to a lesser extent, in induced pluripotent stem (iPS) cells. No significant association was observed in nearly 40 differentiated cell types, nor was any association observed with other retro-elements. Similar strong association was observed between HERV-H and the binding sites within ES cells for the pluripotency transcription factors NANOG, OCT4, and SOX2. NANOG binding sites were located within the HERV-H 5′LTR itself. OCT4 and SOX2 binding sites were within 1 kB and 2 kB of the 5′LTR, respectively. In keeping with these observations, HERV-H RNA constituted 2% of all poly A RNA in ES cells. As ES cells progressed down a differentiation pathway, the levels of HERV-H RNA decreased progressively. RNA-Seq datasets showed HERV-H transcripts to be over 5 kB in length and to have the structure 5′LTR-gag-pro-3′LTR, with no evidence of splicing and no intact open reading frames. Conclusion The developmental regulation of HERV-H expression, the association of HERV-H with binding sites for pluripotency transcription factors, and the extremely high levels of HERV-H RNA in human ES cells suggest that HERV-H contributes to pluripotency in human cells. Proximity of HERV-H to binding sites for pluripotency transcription factors within ES cells might be due to retention of the same chromatin features that determined the site of integration of the ancestral, exogenous, gamma-retrovirus that gave rise to HERV-H in the distant past. Retention of these markers, or, alternatively, recruitment of them to the site of the established provirus, may have acted post-integration to fix the provirus within the germ-line of the host species. Either way, HERV-H RNA provides a specific marker for pluripotency in human cells.


Background
Vertebrate genomes contain retroviral sequences that are believed to be remnants of exogenous retroviral infection from the distant past [1]. The genesis of these endogenous retroviruses (ERVs) necessitates establishment of provirus by the ancestral, exogenous retrovirus within host germ cells, such that these elements are maintained as heritable genetic elements in the host species. Human endogenous retroviruses (HERVs) are not uncommon, and account for at least 8% of the total human DNA. By comparison with DNA sequences from exogenous retrovirus families, three main classes of HERVs, gamma, beta, and delta have been identified [2]. Phylogenetic analyses identified HERV-K, a betaretrovirus, and HERV-H, a gammaretrovirus, as the most recent entries into the genomes of primates, 10 and 25 million years ago, respectively [3,4].
Though some HERVs are transcriptionally active, most retroviral sequences in the human genome are corrupted by mutations or successive insertion of transposable elements; most HERVs lack intact open reading frames for viral protein production and no autonomously replicating HERV has been identified. As a result, HERVs are generally considered non-functional, junk DNA. In some cases, though, HERVs make significant -even essential -contributions to the normal physiology of the host species [5][6][7][8].
Exogenous retroviruses integrate into locations throughout the host chromosomal DNA in a quasi-random fashion (reviewed extensively in [18]). Previously, we developed statistical tools to identify association between the sites of provirus establishment and chromosome marksas determined by chromatin immuno-precipitation with massively parallel DNA sequencing (ChIPSeq) [18,19]. From this analysis we identified chromatin modifications (H3K4me3, H3K4me1, and H3K9ac) and binding sites for transcription factors (STAT1) that predict the site of integration for gamma-retroviruses with great accuracy in a cell-type specific manner [18,19]. Precise markers such as these were not identified for other classes of retrovirus such as the lentivirus HIV-1 [18]. Here, the same statistical methods were exploited to determine whether any ERVs retain association with cell type-specific chromatin features that might have determined the site of integration in the distant past, or that served to fix the provirus in the genetic patrimony of the host species.

Results
Search for chromatin features near the site of ERVs Previously observed, high-level association between gamma-retrovirus integration sites and particular epigenetic markers [18] prompted a quest to find association between any known endogenous retroviral element in the human and mouse genomes, and the cell-type specific localization of particular chromosomal features. For this purpose, ChIPSeq profiles were evaluated from more than 40 different human and mouse cell types, including ES cells, iPS cells, monocytes, HeLa cells, CD4+ T cells, and CD34 + hematopoietic cells (Table 1). H3K4me3 was used in the initial analysis because of the availability of ChIPSeq datasets from a large number of cell types for this marker. The ERV dataset was compiled using all endogenous, LTR-containing elements annotated via RepeatMasker on the human reference genome hg18 (UCSC) or the mouse reference genome mm9 (UCSC).

Figure 1 shows this analysis schematically. The block
Associator is a computational module fed by ChIPSeq profiles and LTR loci. Among all LTRs, only HERV-H showed a significant association with H3K4me3, though this association was only with some human cell types; the location of the endogenous gamma-retrovirus was associated with H3K4me3 profiles in ES cells (F-score >0.8; p < 10 -300 , by Fisher exact test) and, to a lesser extent, in iPS cells (F-score >0.7; p < 10 -100 ). F-scores > 0.5 are considered significant with 1.0 maximal [18], so these values are highly significant. In contrast, no retroviral element in the mouse was found to be associated with H3K4me3.
The data were assessed by means of Hierarchical Clustering. Each association profile was calculated as a function of the distance between HERV-H and the nearest H3K4me3 marker, discretized with steps of 0.5 kB. The specific cells are listed on the x axis of Figure 2, and the window size in kilobases is shown on the y-axis. The Euclidean distance between these profiles was used to discriminate among clusters. The algorithm identified three main clusters of high, medium and low association (represented in red, green and blue in Figure 3). The cluster with the highest association (red) was populated by ES cells (H1, H9 and I3) and iPS cells (iPS-15b and iPS-11a), with a mean F score of 0.74. The cluster with medium association (green) was populated by bone marrow mesenchymal stem cells (DMSC), breast cancer cells (vHMEC), fetal lung cells and fetal brain cells; it had F scores that were just barely significant. Some iPS cells (A6, PDB1lox, 18c, C1) fall within this cluster, thus confirming that many iPS cells have epigenetic profiles distinct from those of ES cells [29]. The blue cluster consisted of differentiated cells, including CD4 + T cells, fibroblasts, pancreatic islet cells, and HeLa cells; the mean F score of 0.36 was insignificant. The association of HERV-H with the post-translational histone modification H3K4me3 correlated strongly with the degree of cell differentiation, a relationship that was clearly visible using chromosome projection mandalas ( Figure 4).

HERV-H expression in ES cells
Given the remarkable, pluripotent stem cell-specific association of HERV-H with H3K4me3, a marker for transcriptionally active promoters [30,31], HERV-H expression would be expected to be higher in ES and iPS cells than in differentiated cell types. Consistent with this possibility, HERV-H was not associated with H3K27me3 (F score 0.2), a marker for transcriptional repression in ES cells [32]. EN-CODE Project RNA-seq data sets [33] of paired 75 nucleotide reads from human ES cells and 6 differentiated cell types were assessed for HERV-H RNA peaks. In H1 human ES cells, HERV-H RNA accounted for 2% of the total RNA. This extraordinary level of HERV-H expression was confirmed with RNA-seq data [34] from H9, another ES cell line. For comparison, expression of the younger and better-conserved beta-retrovirus HERV-K was 1000-fold lower than HERV-H in H1 ES cells. HERV-H RNA was 100-lower in HeLa cells and more than 100-fold lower in K562 myelogenous leukemia cells, GSM12878 lymphoblastoid cells, HepG2 hepatocellular liver carcinoma cells, human umbilical vein endothelial cells (HUVEC), and NHEK epidermal keratinocytes. BRD2, a gene that is expressed at nearly the same level in all these cell types was 25-fold lower in expression than HERV-H in ES cells ( Figure 5).
Almost all the HERV-H RNA expressed in ES cells had the structure 5 0 LTR-gag-pro-3 0 LTR with deletion of the pol and env regions, and no intact open reading frames. HERV-H fragments containing pol and env sequences were barely detected ( Figure 6).
It has been reported that human ES cell DNA is hypomethylated with respect to differentiated cell lines, and that this global effect releases all endogenous elements, such as SINEs, LINEs, and HERVs, from transcriptional silencing [35]. To determine if HERV-H expression in ES cells is simply a result of this global trend, the RNAseq data was used to compare the expression level of all repetitive elements in ES cells. Compared to LINEs and SINEs, HERV-H was by far the most expressed repetitive element in human ES cells, accounting for nearly all transcription of the HERV family ( Figure 7).
It is also important to consider that expression of endogenous retro-elements might be influenced by local effects from adjacent transcriptional units. The difference between the expression level of the repetitive elements and the expression level of the surrounding sequences was assessed (see Methods for details). By this measure, HERV-H exhibited transcriptional specificity comparable to that of conventional genes, in terms of specific/unspecific transcription ratio (0.85 vs. 0.86), while LINEs and SINEs had a transcription ratio of 0.06 and 0.03, respectively ( Figure 7).

HERV-H expression over the course of human ES cell differentiation
The disparity between the high transcriptional level of HERV-H in human ES cells, and the near absence of HERV-H transcription in differentiated cells prompted an assessment of HERV-H expression as ES cells differentiate. To accomplish this, raw data was analyzed from a 4-point time course experiment (http://www.ncbi.nlm.nih.gov/geo/ query/acc.cgi?acc=GSE20301) in which RNA was collected from human ES cells in the undifferentiated state (N0), early initiation (N1), neural progenitor (N2), and pre-glial cell (N3) stages [26]. High HERV-H transcriptional levels were confirmed in the undifferentiated N0 stage, and HERV-H levels decreased progressively with a magnitude that correlated with the differentiation time-point ( Figure 8).

HERV-H and pluripotency transcription factors
When ectopically expressed in particular combinations, the transcription factors NANOG, OCT4, and SOX2 are capable of reprogramming mature somatic cells into pluripotent stem cells [36][37][38][39]. Conversely, the expression of these factors decreases as cells differentiate [40]. The RNA-seq data sets utilized above that measure the dynamics of RNA in ES cells as they differentiate into neural progenitors was analyzed for expression of HERV-H and for these transcription factors. HERV-H RNA levels correlated well with those of NANOG and OCT4 ( Figure 9). SOX2 was more stably expressed during differentiation than was NANOG or OCT4 and its expression did not correlate with that of HERV-H. To explore the possibility that these pluripotency transcription factors are somehow involved in HERV-H transcription, an association analysis was conducted between HERV-H locations and the ChIPSeq binding profiles of NANOG, OCT4, SOX2, and KLF4 extracted from ES cells [24,25] (Figure 10). Over a 2 kB window, NANOG showed the most significant association (F score 0.75, p < 10 -180 ). OCT4 weakly associated with HERV-H (F score 0.55, p < 0.0001), while SOX2 and KLF4 were not associated (F score 0.46 and 0.24, respectively).
Statistical significance increased considerably when the expression level of individual HERV-H proviruses was monitored ( Figure 11A and B). As visualized on the chromosome projection mandala in Figure 11C, NANOG binds to the LTRs of 96% of the 50 most highly expressed HERV-H proviruses. These 50 proviruses account for 80% of all HERV-H RNA. The LTR region corresponding to the first 0.5 kB of HERV-H proviral sequence is highlighted in green, with dot sizes proportional to expression levels. With a window of 0.5 kB, the F score for NANOG was nearly perfect (0.97, p < 10 -300 ). Limited to this subset of HERV-H proviruses the F scores for OCT4 (0.90 wi2kB, p < 10 -300 ) and SOX2 (0.85 wi4kB; p < 10 -300 ) were also tightly linked to HERV-H. KLF4, in contrast, was not associated with HERV-H (F score 0.34). The average distance of the NANOG, OCT4 and SOX2 ChIPSeq peaks from the HERV-H transcription start sites was 250 basepairs, 1 kB and 2 kB, respectively, as if these factors were uniformly distributed along the HERV-H promoter in Figure 1 Schematic showing the strategy for detecting association between endogenous retro-elements and the H3K4me3 marker. H3K4me3 CHiPSeq data for more than 40 different cell types were checked for association within 2 kB of all annotated endogenous retroviral elements (hg18) using methods described in reference [18]. Only HERV-H proviruses showed association with H3K4me3, and this was only significant in human ES cell lines. No association between H3K4me3 and endogenous retro-elements was detected in the mouse.

Figure 2
H3K4me3 association with HERV-H proviruses, as measured with the F score, in 40 different cell types, as a function of the distance from the marker. The F score is reported on the z-axis. Cell types are listed on the x-axis, while the window size is reported on the yaxis. Yellow to red bands indicate a significant association (F score >0.5). Significance decreases as the color shifts from red to blue. 12 out of 15 of the cell types with significant F scores are either human ES cells or iPS cells.
human ES cells ( Figure 11B). Indeed, for 80% of the highly expressed HERV-H proviruses NANOG, OCT4 and SOX2 map in this order.

Association of LINE elements with NANOG in human ES cells
As a matter of comparison, a chromosome projection mandala was created in which NANOG binding sites in ES cells were compared with the location of the LINEs ( Figure 12). Only one LINE, located on chromosome 13 between nucleotides 55,048,158 and 55,050,476, was expressed to an appreciable level. Interestingly, a deeper inspection of this region revealed a particular genomic structure where LINE transcription was driven by an adjacent HERV-H, bounded by NANOG binding sites at both LTRs. This element does not have an intact ORF, and its functional role is unclear. It is also noteworthy that the adjacent HERV-L is not expressed, even if it is immediately adjacent to the transcribed region, confirming that HERV-H expression is not a non-specific consequence of hypomethylation in ES cells.

Discussion
This study discovered a strong association between the genomic location of HERV-H proviruses and H3K4me3modified histones, that is exclusive to human ES and some iPS cells. Consistent with H3K4me3 serving as a marker for active transcription, HERV-H expression was high in these pluripotent stem cells. Moreover, the pluripotency transcription factors NANOG, OCT4 and SOX2 bound to the LTRs of transcriptionally active HERV-H proviruses, or within 2 kB of them. NANOG, OCT4 and SOX2 frequently co-occupy the promoters of target genes, many of which are transcription factors that regulate development such as homeodomain proteins [41]. Hierarchical cluster analysis of the HERV-H/H3K4me3 F score, as a function of the window size (horizontal axis). F score is represented using the color scale as in Figure 2. The highly associated cluster (red tree) was made of human ES and iPS cells. The cluster with medium association (green) consisted of mesenchymal stem cells (DMSC), fetal lung and brain cells, and some iPS cells, with F scores that were barely significant. The blue cluster consisted of differentiated cells and the mean F score of 0.36 was insignificant.
These observations strongly support the hypothesis that HERV-H transcripts play a role in human pluripotency and that this role is finely regulated by three of the most important transcription factors in ES cells. In addition to the binding of NANOG, OCT4, and SOX2 to the HERV-H promoter, HERV-H RNA decreased as ES cells differentiated, in a manner that was proportional to the expression of NANOG and OCT4. Conversely, HERV-H RNA was undetectable in primary fibroblasts but increased enormously after forced re-programming to generate pluripotent stem cells (unpublished data provided by Audrey Letourneau and Stylianos Antonarakis). HERV-H, then, can be exploited as a reliable marker of ES cell pluripotency, as well as an indicator of the degree of "stemness" of iPS cells as they are generated from fibroblasts.
HERV-H transcripts are 5 to 6 kB in length and lack open reading frames. We can only speculate about the function of these lncRNAs. They might, for example, serve to soak-up miRNAs that promote differentiation, as has been shown with linc-MD1 in muscle differentiation [42] Figure 5 Cumulative expression of all HERV-H proviruses in human H1 ES cells (hESC), HeLa cells, or K562 cells, compared to expression of HERV-K and BRD2, a constitutive gene with the same expression level in all three cell types. In human ES cells, HERV-H is expressed 1000-fold higher than HERV-K and 25-fold higher than BRD2. HERV-H expression is barely detectable in HeLa, and no significant HERV-H expression was detected in K562 cells. RNASeq data for this analysis were from reference [33]. or the PTENP1 pseudogene in the regulation of PTEN and growth suppression [43]. They might bind to chromatin and act as a scaffold for the local recruitment of pluripotency transcription factors, similar to other lncRNAs like HOTAIR for histone modification complexes [44] and Xist in the context of X-chromosome inactivation [44][45][46]. Alternatively, HERV-H might counteract retrovirus spread by interfering with packaging of retroviral genomic RNA [47,48] or by soaking up miRNAs that are required for retrovirus transduction.
The study here failed to identify chromatin markers that associate with endogenous retro-elements in mice. This was somewhat surprising given the many endogenous retro-elements in this species, including endogenous gamma-retroviruses, some of which are intact and functional [8]. It was also surprising because exogenous gamma-retroviruses have the same integration site preferences in mouse cells as they have in human cells [18]. MLV integration sites are associated with the H3K4me3 profile in mouse embryonic fibroblasts (F score = 0.83; Figure 6 Mapping of RNA-seq reads from H1 human ES cells on a schematic of the HERV-H provirus. The quantity of each RNA read was normalized to the reads corresponding to the 5 0 LTR. Only RNA fragments corresponding to 5 0 LTR-gag-pro-3 0 LTR were expressed to a significant level in human ES cells. p < 10 -100 ). Similar results with murine hematopoietic stem cells (F score of 0.81; p < 10 -100 ) indicate that, as in human cells, the association strength is cell-type dependent.
One possible explanation for the failure to identify chromatin markers associated with endogenous murine retroviruses is species-specific differences in the recruitment of the transcriptional silencing machinery. In murine ES cells, for example, a sequence-specific DNA-binding protein, ZNF809 [49], recruits TRIM28 and other components of the cellular machinery that silence MLV [50]. ZNF809 has no orthologue in humans; perhaps ZNF809 arose as a result of selective pressure exerted by murine specific gamma-retroviruses during evolution.
Previous work demonstrated that when exogenous retroviruses integrate they home to sites of H3K4me3 [18]. Similarly, the association of endogenous gamma-retrovirus HERV-H with H3K4me3 suggests that when human and simian germ cells were bombarded with the HERV-H ancestor 15 to 30 millions years ago, these ancient retroviruses integrated in proximity to H3K4me3-marked chromatin. These proviruses might then have retained these cell-type specific marks as they became fixed in the primate genome. Alternatively, unmethylated HERV-H LTRs might have recruited chromatin remodeling factors and induced H3K4me3 modification of the viral promoter after integration had occurred.  Analysis of the DNA surrounding HERV-H proviruses failed to clarify which of these two scenarios is more likely. Additionally, search for epigenetic markers like H3K4me3 in syntenic regions in the mouse genome was attempted to determine if these chromatin marks are conserved across the species and predate the entry of HERV-H into the primate genome. DNA surrounding HERV-H proviruses in the human genome was aligned to the mouse genome (using the tool Lift-Over, http://genome.ucsc.edu/cgi-bin/hgLiftOver) after Figure 10 Modified chromosome projection mandalas depicting NANOG, OCT4, SOX2 and KLF4 associations with HERV-H proviruses. Mandalas were generated as described in reference [18] and Figure 4 above. The green ring indicates the HERV-H 5 0 LTR region (500 nucleotides from the transcriptional start site). Dot size is proportional to the expression level of each single provirus. NANOG is bound to the 5 0 LTR of almost all highly expressed retroviruses. OCT4 shows a significant association. SOX2 binds all expressed HERV-H in a region between 1 KB and 2 KB while KLF4 does not show any significant association pattern.  Figure 11 Ordered spacing of pluripotency transcription factors, binding to the HERV-H 5 0 LTR in human ES cells. (A) Association strength (F score) of NANOG, OCT4 and SOX2 binding sites with the 50 most highly expressed HERV-H proviruses (accounting for 80% of total HERV-H expression), as a function of distance from the HERV-H transcription start site (TSS). Maxima in F score indicate the distance of greatest association. (B) Average distance of NANOG (red), OCT4 (blue) and SOX2 (green) to HERV-H TSS is shown schematically. As expected from a uniform distribution model, the average distance is half of the distance between maximal association and TSS. (C) Chromosome Projection Mandala combining NANOG, OCT4 and SOX2 with respect to 50 HERV-H proviruses. The three embryonic transcription factors bind with the same order (NANOG-OCT4-SOX2) to the promoter region of the most expressed HERV-Hs. . The direction of transcription was determined by strand-specific sequencing [33]. NANOG bound to both LTRs (represented with black squares along the LTR row) in the adjacent HERV-H. The adjacent HERV-L was not transcribed. excision of all repetitive elements (performed with RepeatMasker, http://www.repeatmasker.org/). Nothing informative was found by measuring the association of mouse H3K4me3 with these syntenic regions and comparing these values with those obtained using control loci.
P300 and H3K27ac bind intra-species conserved regions and co-localize in embryonic-specific enhancers [34,51]. These markers are also associated with HERV-H in human ES cells, lying within 4 kB of 80% of HERV-H proviruses (F score 0.95; p < 10 -300 ). It seems unlikely that an exogenous retrovirus would be capable of recruiting these factors by exploiting random conserved regions around its integration site. This suggests that a pre-existing layer of epigenetic markers favored integration of HERV-H into particular host loci and that these features are still preserved millions of years later.

Conclusions
Among retroelements in the human genome, the endogenous gammaretrovirus HERV-H is extraordinary for its high level expression in embryonic stem cells, in which it makes up 2% of all polyadenylated RNA. The human genome has~1,000 copies of HERV-H, and the majority of the HERV-H RNA is encoded by a subset of 50 of these. HERV-H expression decreases as ES cells lose pluripotency, to the point where its expression is undetectable in fibroblasts. Consistent with this expression pattern, HERV-H is also expressed to high level in many iPS cells, though expression in some iPS cells is more modest; this heterogeneity may reflect reported differences in the epigenetic profile of many iPS cells, when compared with ES cells [29]. This suggests, then, that HERV-H RNA offers a relatively stringent marker for human pluripotency that would be worth monitoring during the generation of new iPS lines. The HERV-H RNAs in ES cells average about 5 kB in length and encode no protein. It, therefore, seems likely that HERV-H RNA contributes to pluripotency by acting as a chromatin-associated structural element or by acting as a microRNA decoy.

Statistical analysis of association between chromatin markers and retro-elements
Association with a given marker was defined as the presence of the endogenous proviral DNA within a fixed distance (usually 2 kilobases) from the nearest marker on the linear sequence of the chromosome. Unlike exogenous retroviruses, the endogenous virus is already integrated. Therefore, to restore the conditions before integration, the distance from the j-esim marker M j with peaks in loci {m j0 , . . ., m jN } and the i-esim provirus V i spanning along the loci {V i s , V i e } has been calculated as is the central locus of the provirus. As a control dataset, we randomly selected 100000 genomic locations. Association strength was measured with the statistical method based on the F score, as previously described in [18].
Formally the F β -score is defined as the β-weighted harmonic mean of Precision(P) and Sensitivity(R): Here β = 0.5 to give more weight to Precision than to Sensitivity. This balances type I and type II errors by adjusting for the high rate of False Positives inherent to the examination of large datasets for genome-wide binding sites according to statistical significance (F score based statistics and comparison with other measures have been extensively discussed in [18]).
Markers with F scores ranging between 0.5 and 1 were considered to be associated with endogenous integration sites.

RNASeq data analysis
HERV-H is present in more than 1000 "imperfect" copies in the human genome and its transcripts share a number of short conserved regions (each around 100 bp). Therefore, deep sequencing of those transcripts yields reads (25-75 bp long sequences) which perfectly align to several genomic loci. Indeed, multireads mapping is still a challenging process [52]. The strategy adopted here was to perform the alignment of uniquely mapped single-and paired-end reads and to reassign the multiple-mapped reads in function of the expression level of the surrounding (context) region [26]. HERV-H expression was evaluated in term of "Reads Per Kilobase per Million mapped reads (RPKM)" with the standard formula E r ¼ K N r L r N T [53] where N r is the number of reads mapping onto the r transcript, L r is the length (in kB) of the r-esim transcript and N T is the total number of reads K = 10 6 .
Alignments of RNASeq generated reads have been performed with a two-step procedure. First we used Bowtie [54] on raw data, admitting up to two mismatches for each alignment, and then we discriminated between unique mapping reads and multireads, 80% and 20% of the total number of reads respectively. As expected, many multireads matched with repeated elements. Ignoring them would have resulted in a potential underestimation of the expression of endogenous retroviruses and, in general, of all repetitive elements. Therefore, we adopted a probabilistic assignment based on the amount of reads that univocally map onto the surrounding regions (context), as described in more detail in the next paragraph. This evaluation is also useful to establish if a repeated element is expressly transcribed or if it is part of another structure (i.e. a gene).

Probabilistic alignment of multiple mapping reads
The uniquely mapping read was defined as a short sequence generated by high throughput sequencing that can be aligned to a single genomic region s. Accordingly, we defined the multiread as a sequence r that aligns to a set of M regions {S 1 ,. . .,S M }. For each region s i the context region is c i = C i /(C i \ s i ) being C i a genomic region of n nucleotides encompassing s i . The assumption that the amount of reads is proportional to the amount of actual mRNA implies that the set of multireads is distributed on the reference genome accordingly to the amount of uniquely mapping reads aligned to the context regions. Consider R D s ð Þ as the function giving the number of reads of the dataset D that map univocally to the genomic region s described by the tuple (chr, start, end). Therefore, the probability of the read r to be actually part of the mRNA generated from the region s i is estimated as: Eventually, the set of multireads mapping to the same M regions is then partitioned to {S 1 ,. . .,S M } accordingly to P D (s i ).

Specificity
The first axiom of high throughput sequencing asserts that the number of reads aligned to a specific genomic region is proportional to how much RNA has been generated by this region within the cell. At this point it is worth observing that repeated element (RE) sequences might be present in the RNA just because they are part of longer mRNAs.
Since we expect that elements having a specific biological function are independently transcribed, we attempted to distinguish between RE specifically expressed with their own promoter from those that are part of longer RNAs. The number of reads mapped to a region s can be naively modeled as a linear combination of specific reads T(s), unspecific reads U(s) and additional zero-mean noise σ 2 that account for all other experimental and nonsystematic fluctuations that can randomly influence the output of the sequencing process. Formally: where R s ð Þ, as before, is the function giving the number of reads assigned to the region s. Therefore, the mean number of reads in the region s is: E T s ð Þ ½ ¼R s ð Þ À U s ð Þ: In order to estimate U(s), it is possible to count the number of reads mapped to the context region c as previously shown. Therefore we set and we eventually adopt the following approximation to correct for the non-specificity of transcription: Abbreviations HERV: Human endogenous retrovirus; ES: Embryonic stem; iPS: Induced pluripotent stem; LTR: Long terminal repeat; lncRNA: Long non-coding RNA.