Identification of the distribution of human endogenous retroviruses K (HML-2) by PCR-based target enrichment sequencing

Background Human endogenous retroviruses (HERVs), suspected to be transposition-defective, may reshape the transcriptional network of the human genome by regulatory elements distributed in their long terminal repeats (LTRs). HERV-K (HML-2), the most preserved group with the least number of accumulated of mutations, has been associated with aberrant gene expression in tumorigenesis and autoimmune diseases. Because of the high sequence similarity between different HERV-Ks, current methods have limitations in providing genome-wide mapping specific for individual HERV-K (HML-2) members, a major barrier in delineating HERV-K (HML-2) function. Results In an attempt to obtain detailed distribution information of HERV-K (HML-2), we utilized a PCR-based target enrichment sequencing protocol for HERV-K (HML-2) (PTESHK) loci, which not only maps the presence of reference loci, but also identifies non-reference loci, enabling determination of the genome-wide distribution of HERV-K (HML-2) loci. Here we report on the genomic data obtained from three individuals. We identified a total of 978 loci using this method, including 30 new reference loci and 5 non-reference loci. Among the 3 individuals in our study, 14 polymorphic HERV-K (HML-2) loci were identified, and solo-LTR330 and N6p21.32 were identified as polymorphic for the first time. Conclusions Interestingly, PTESHK provides an approach for the identification of the genome-wide distribution of HERV-K (HML-2) and can be used for the identification of polymorphic loci. Since polymorphic HERV-K (HML-2) integrations are suspected to be related to various diseases, PTESHK can supplement other emerging techniques in accessing polymorphic HERV-K (HML-2) elements in cancer and autoimmune diseases.


Background
Human endogenous retroviruses (HERVs) are relics of ancient germ cell infection by exogenous retroviruses that became incorporated into germ line DNA [1]. HERVs are thought to have been co-opted into physiological roles in the host even though previously they have been regarded as "junk DNA". The more direct advantage for the host of endogenization may be the protection against infection by related exogenous viruses [2]. Furthermore, HERVs have been shown to be important determinants of pluripotency in human embryonic stem cells and in the reprogramming process of induced pluripotent stem cells [3]. HERV-K (HML-2), the best evolutionarily preserved and most biologically active group among the HERVs, maintain coding competence with complete, or near-complete ORFs for all viral proteins [4,5]. They have elevated transcripts expression, induce antibodies to viral proteins, such as gag and env, and generate viruslike particles in certain types of diseases [6,7]. Increasing numbers of studies are now focusing on the pathogenetic mechanisms of HERV-K (HML-2) [8,9]. One mechanism is the viral proteins produced by HERV-K (HML-2); for example, rec, np9, and env, may work as onco-proteins. Another important mechanism is the regulatory functions of LTRs, which may enable the LTRs to serve as genome-wide regulators, including alternative promoters, enhancers, polyadenylation signals, and binding sites for transcription factors [10].
HERV-K (HML-2) is the only group of HERV-K containing human-specific and polymorphic loci. Polymorphic loci that are unfixed in the modern human population suggest that they remained transcriptionally active up to very recent times in the evolutionary history of the Homo Sapiens species. When compared with other HERV loci, HERV-K (HML-2) have fewer accumulated mutations, indicating that they likely continue to co-evolve with the host, and they are probably still biologically active at present. Insertionally polymorphic HERV-K (HML-2) loci show different structural features, including full-length provirus integration, solo-LTR integration and unoccupied pre-integration, in different individuals [11,12]. Therefore, polymorphic HERV-K (HML-2) integrations have the capacity to alter the expression of viral proteins as well as LTR regulation of host-genes. Several polymorphic loci have been reported to be correlated with diseases. For example, the distribution of two polymorphic HERVs, HERV-K113 and HERV-K115, have been shown to be associated with autoimmune diseases [13,14]. A polymorphic HERV-K (HML-2) solo-LTR insertion (1p13.2) is possibly involved in lung adenocarcinoma [15]. Additionally, a polymorphic locus, in 5q14.1, integrated within RAS-GRF2, can disrupt host gene transcription and is associated with drug addiction [16]. Therefore, studies on the genome-wide distribution of HERV-K (HML-2) to identify polymorphic insertion loci should enable the possible identification of disease-related genes and their regulation by HERV-K (HML-2) elements.
Recently, several studies have focused on positioning of the location of HERV-K (HML-2) in the human genome (Table 1). In 2011, Subramanian et al. mined the hg19 genome and identified that more than one thousand proviral and solo-LTR HERV-K (HML-2) integrations exist in the hg19 genome [17]. Researchers focused attention on non-reference loci that do not exist in hg19. With the emergence of next generation sequencing (NGS), several data-mining tools have been established capable of evaluating the status of HERV-K (HML-2) from whole genome sequencing (WGS) data [18][19][20][21]. Dozens of non-reference and unfixed loci now complement previous HERV-K (HML-2) data. To determine the accuracy of genotyping HERV-K (HML-2) integrations, higher depth WGS data is needed; however, the cost in time and money is high, although information of the distribution of HERV-K (HML-2) makes it is possible to characterize the polymorphic nature of HERV-K (HML-2)

Techniques
Need sequencing? Target enrichment? High throughput?

Detectable loci Reference
Mining in the reference human genome (hg19)

No
No Yes Proviruses and solo-LTR Subramanian et al. [17] Locus-specific PCR Yes, Sanger sequencing Yes, PCR No Proviruses or solo-LTR Otowa et al. [22] Moyes et al. [23] Wildschutte et al. [24] Genome-wide amplification of proviral sequences (GAPS) integrations. Locus-specific PCR can target tens of loci instead of only one or two loci; however it is usually limited to throughput using Sanger sequencing [22][23][24]. A modified PCR technique, genome-wide amplification of proviral sequences (GAPS), established to identify polymorphic proviral loci, is also restricted by the restriction enzyme (VspI) cutting sites and sequencing method [25]. Therefore, development of an appropriate and economical method, requiring a low volume of sequencing data but capable of identifying a large number of loci, is important. Target enrichment sequencing may be an ideal approach. A specific PCR enrichment protocol targeting HERV-K (HML-2) proviruses has been successful in identifying gorilla-specific loci integrated polymorphisms [26]. A probe-based enrichment sequencing method targeting proviral and solo-LTR integrations has been reported [16,27]. However, enrichment sequencing techniques using the Miseq platform produces fewer reads compared with Hiseq. In the present study, we established a PCR-based target enrichment sequencing of HERV-K (HML-2) (PTESHK) that both detects reference loci and identifies novel loci, which will be helpful for completing the full genomic distribution of HERV-K (HML-2) in the current reference genome. Furthermore, PTESHK is also capable of identifying polymorphic insertion loci by comparing the relative abundance of the same loci in different individuals. PTESHK could be a supplementary technique to detect disease-related polymorphic HERV-K (HML-2) loci and advance research on the contribution of HERV-K (HML-2) to human diseases.

Collection of all known HERV-K (HML-2) loci
A clear knowledge of the integration loci and nucleotide sequences of the HERV-K (HML-2) is paramount to understanding their biological functions. Recent discoveries have catalogued full-length and partial HERV-K (HML-2) proviruses, and LTR elements lacking other HERV-K internal sequences (solo LTRs) in the human genome. In addition to the reference loci found in the GRCh37 (hg19) genome [17], dozens of non-reference (not present in the human reference genome) loci have been identified from whole genome sequencing (WGS) data using different analytic tools [19][20][21]. In total, there are 1063 reported loci whose insertional elements possess functional elements, including 85 reference provirus insertions, 946 reference solo-LTR insertions, 5 non-reference provirus insertions, and 27 non-reference solo-LTR insertions [17][18][19][20][21]. Among them, 69 have been confirmed to be polymorphic insertion (unfixed) loci [17][18][19][20]28] and 89 seem to be fixed in Chinese populations [19]. We collected information on known HERV-K (HML-2) (Additional file 1: Table S1) for comparative analysis with data generated by target enrichment.

PCR-based target enrichment sequencing of HERV-K (HML-2) (PTESHK)
In this study, a PCR-based target enrichment sequencing of HERV-K (HML-2) (PTESHK) has been established to amplify and determine individual HERV-K (HML-2) element insertion sites within genomic DNA. The method takes advantage of target enrichment amplification of HERV-K (HML-2) integration sites and construction of sequencing libraries for obtaining HERV-K (HML-2) individual proviruses or solo LTR sequence linked to insertion site sequences (Fig. 1). Target enrichment amplification of HERV-K (HML-2) is a modification of the genome-wide amplification of proviral sequences (GAPS) technique previously published [25]. There are two modifications of PTESHK over the GAPS method. One is that GAPS focuses on proviral integrations, while PTESHK is ideal for all HERV-K (HML-2) insertions, including proviral HERV-K (HML-2) insertions and solo-LTR insertions. For this purpose, degenerate primers for suppression PCR and nested PCR were redesigned to be specific for the consensus LTR nucleotide sequence, which was generated after an alignment of all reference proviral HERV-K (HML-2) sequences ( Fig. 1, blue solid arrow for the 5′ group, green solid arrow for the 3′ group). The other modification is that PTESHK utilizes NGS, resulting in higher throughput than GAPS. To detect as many HERV-K (HML-2) loci as possible, an enzyme that digests the genome randomly was used to replace the restriction enzyme VspI and the GAPS adapter was optimized for A-T ligation (details shown in Methods and Materials). Furthermore, after the amplification, the PCR products were sequenced using an NGS platform which allowed for a greater number of sequencing reads than could achieved by Sanger sequencing used in the GAPS method.
For transposable elements and retrovirus detection in high-throughput sequencing data, we used the STEAK (Specific Transposable Element Aligner (HERV-K)) tool as the core component for pipeline analysis of our data [27]. We also adjusted STEAK to improve the accuracy of mapping each location and to reduce the number of false-positive loci. As there are two LTR sequences in the host genome for each HERV-K (HML-2) provirus, a number of LTR-gag or LTR-env containing chimeric sequences, in addition to LTR-host genome sequences, will be obtained following the amplification step (generated by primers annotated by the dotted grey arrows in Fig. 1). Flanking sequences were obtained following trimming by STEAK and alignment. Sequences that showed homology to any of the reference HERV-K (HML-2) sequences (mostly gag and env sequences) were discarded, and the remaining fragments were mapped. To ensure the accuracy of the alignment to integration loci, reads smaller than 10 nt were discarded after the removal of the GAPS adapter sequence. For the same purpose, only uniquely mapped reads were extracted for the detection of the HERV-K (HML-2) loci after alignment to hg19 by Novoalign.

Detection capability of PTESHK
In this study, three healthy Chinese males (age range 25 to 29 years; individuals labeled as "P", "W" and "Y") were recruited for the sample collection to establish the utility of the PTESHK method. Three replicates (experimental replicates, denoted "1", "2" and "3") of each individual were performed to evaluate the repeatability of the method.  The workflow is divided into three phases: target enrichment amplification, library construction and bioinformatic analysis. Genomic DNA is randomly enzymatically digested at 30 °C for 5 min to form DNA fragments. Then, after end-repair and A-tailing, a GAPS adapter, to be used for the amplification, is ligated. Suppression PCR is then performed with specific primers targeting the GAPS adapters and 5′/3′ LTR sequences of HERV-K. Nested PCR with inner primers is then used to increase the amplification specificity. After amplification, PCR products of both the 5′ and 3′ ends are mixed together and cleaned with beads. Another end-repair and A-tailing is performed for the ligation of Illumina sequencing adapters, followed by size-selection. Then the library is ready for sequencing on the HiSeq X Ten platform. After sequencing, all sequencing reads are filtered by Trimmomatic to remove low-quality reads, followed by the discarding of PCR duplicates using Picard. Then paired-end reads are merged based on the overlap before filtering the chimeric reads, including LTR and flanking sequences, with STEAK. After acquisition of the trimmed flanking sequence, the GAPS adapter is cut by Cutadapt, and then those reads that mapped to HERV-K sequences are abandoned after the alignment. Last, the reference loci and non-reference loci are catalogued with help of a BED file. Then analysis is done to assess the method. In addition, a schematic of all the primers specific for HERV-K (HML-2) sequence used for NGS library construction was drawn According to all 1063 reference loci [17,[19][20][21], any locus with at least one single read mapping within 10 bp of an original reference locus was considered as a detected reference locus. In all 9 samples, 943 out of 1063 reference loci were detected (Fig. 2a). To determine the number of HERV-K loci identified by PTESHK with increasing read counts, we obtained increasing numbers of reads from raw sequence data of different samples and identified the number of HERV-K (HML-2) loci. All HERV-K (HML-2) integration loci with at least a single mapped read are summarized in Fig. 2b. All 9 samples, comprised of three repeat samples from each of the three individuals, exhibited a similar trend of variation, with a peak value of about 910 detectable loci plateauing when the number of reads reached approximately 15 million.
We also discovered several novel loci using PTESHK. Following the initial analysis of our samples, many reads were found to map to sites located more than 2 kb away from the known reference loci. These loci were therefore classified as novel loci. Integrative Genomics Viewer (IGV) was then used to visualize and localize the position and determine the genomic environment of the region. Thirty solo-LTR insertion loci with a relatively high detection rate were found to be novel reference loci whose sequences could be found in hg19 but have previously not been reported (Table 2). In addition, we found 5 non-reference loci shown by the target duplicated sequence (Additional file 2: Fig. S1) whose integration sites were all within repeat elements. Specific primers were designed and used to verify these 5 loci. One of the novel loci, located on chromosome 6 (denoted as N6p21.32), could be amplified using nested PCR and had a 962 bp solo-LTR (LTR5_Hs) insertion (GenBank accession number: MN325700). Unfortunately, the remaining four novel loci were found within repeat elements and could not be amplified using nested PCR. The location within repeat elements and the low detection abundance of the 4 loci compared with N6p21.32 (visualized by IGV in Additional file 2: Fig. S1) may be the reason for our difficulties in verifying their location. In total, 35 novel loci have been discovered from examination of only three individuals, suggesting there might be additional HERV-K (HML-2) proviruses and solo LTRs that remain undetected.   Identification and characterization of HERV-K (HML-2) loci in the human genome using PTESHK. "P", "W" and "Y" indicate the three different individuals, each with three replicates (marked as "1", "2", and "3"). a Venn diagram exhibiting all the HERV-K (HML-2) detected with at least one single read mapped in this study. Combining the 1063 reference loci and 35 novel loci detected in this study, there were in total 1098 loci. PTESHK detected 943 reference loci and 35 novel loci. In each individual, over 900 loci could be detected in at least one repeat (957 of "P", 951 of "W", and 958 of "Y"). b Linear graph, exhibiting the number of detected HERV-K (HML-2) loci with different numbers of raw reads extracted, shows a similar variation trend for all samples. With 15 million raw reads, PTESHK was able to detect about 800 loci in every sample. c Genome-wide distribution of HERV-K (HML-2) loci. There were 943 detected reference loci, 120 undetected reference loci, and 35 novel loci. The detected reference loci were subdivided into near gene loci, polymorphic integrated loci, and polymorphic integrated near gene loci. Novel loci could be divided into novel reference loci and novel non-reference loci. Different groups are highlighted with different symbols in the figure. The distribution of these HERV-K (HML-2) loci shows that HERV-K (HML-2), including the 35 newly detected loci, tends to be located in gene rich regions (some highlighted by red arrows) Using PTESHK for analysis of 3 different individuals with 3 replicates of each led to the identification of 978 (out of 1098) HERV-K (HML-2) loci (Fig. 2a). Both the 978 detected loci and 120 undetected loci are summarized by their chromosome coordinates in Fig. 2c. In addition to the number of HERV-K (HML-2) proviral or solo LTR elements in each chromosome being significantly related to their gene content [17], the distribution of the HERV-K (HML-2) was also concentrated in gene-rich regions (Fig. 2c, red arrows) rather than in gene-poor regions. Among the 1063 reference loci, 348 loci (near gene loci) are located in or within 2 kb of genes [16]. Most of the 35 newly detected loci in our present study were also distributed in regions with high gene density. Remarkably, 8 of the 35 novel loci were located within introns of protein-coding genes while 9 novel loci were located internally or adjacent to non-coding transcripts ( Table 2). The non-random distribution of HERV-K (HML-2) in gene-rich regions might be explained in that euchromatic areas are more easily accessible for proviral integration, also demonstrating the potential of HERV-K elements for regulating gene activities.

Evaluation of PTESHK
First, we evaluated the ability of PTESHK to identify HERV-K (HML-2) loci by calculating the detection of fixed HERV-K (HML-2) loci in Chinese individuals. Based on the frequency of HERV-K (HML-2) loci calculated by Wildschutte et al. among large populations, 89 loci were previously shown to be fixed in Chinese individuals with a frequency of 100% [19]. We calculated the number of detected fixed loci according to different numbers of raw reads in Fig. 3a. There are at least 86 fixed loci detected in every sample, and no less than 84 fixed loci detectable when the number of raw reads reached 1 million (Fig. 3a). Therefore, PTESHK shows good ability to detect fixed loci. Within all 9 samples in our study, there were only 5 fixed loci that were difficult to detect. Four reference solo-LTR integrations, including solo-LTR370, solo-LTR785, solo-LTR911, and solo-LTR945, are integrated in repeat elements, L1PA2, AluSx, LTR17, and beta satellite repeats, respectively. The reason these known fixed HERV-K (HML-2) integrations were more difficult to recover than others has been discussed before, which may be caused by the sensitivity of STEAK and can be resolved by lowering the tolerance or allowing for multi-location mapping [27]. A recent study has shown that another locus, located in 3q21.2, has been demonstrated to be a polymorphism [18]. Therefore, we cannot rule out the possibility that the undetectable fixed loci are polymorphic in our samples. Next, we evaluated the repeatability of PTESHK in the total number of detected HERV-K (HML-2) among different repeats. Within the experimental replicates of the same individual, about 880 loci were detectable in all repeat experiments of each individual (872 for "P", 889 for "W" and 886 for "Y") ( Fig. 3c-e). A Chi square test was performed to test the difference of the total number of detected loci in the different samples. Among the experimental replicates of both "W" and "Y", there was no significant difference (p = 0.695 for "W" and p = 0.279 for "Y"). However, there appeared to be a significant difference (p = 0.020 < 0.05) among the 3 repeats of "P", which may have been caused by the high disparity of the raw data (the raw data of "P2" was about 1.7-fold that of "P1"), resulting in additional loci with only a few reads mapped. To normalize the difference among the experimental replicates caused by the difference of raw data volume and to avoid possible false-positive loci, we chose to use "counts per million (CPM)" for measuring loci. Here we set a threshold of CPM ≥ 50 for filtration. The number of detected HERV-K (HML-2) insertion loci of increasing raw sequences exhibiting a similar variation trend after screening is summarized in Additional file 3: Fig. S2. When the raw data volume reached 15 million, there appeared to be little difference in the number of detected loci with the increase of the data volume (Additional file 3: Fig. S2b). Although the number of detected loci was reduced after filtering, 712 loci that were detected in at least one sample remained (Additional file 3: Fig. S2a). More than 560 loci could be detected in all repeats of each individual (566 for "P", 563 for "W" and 567 for "Y") (Additional file 3: Fig. S2c-e). While testing the difference of the number of detectable loci among experimental replicates using the Chi square test, all exhibited no significant difference after CPM filtering (p = 0.600 for "P", p = 0.291 for "W" and p = 0.864 for "Y"). Finally, we evaluated the repeatability of the relative abundance (CPM value) of a same locus in different repeats. While paying attention to CPM values of different loci of the same sample, differences over tens of fold could be observed; for example, the average CPM value of all the samples of solo-LTR946 (11459.45) is about 17-fold of that of solo-LTR185 (664.56). To determine whether this difference is consistent among different experimental replicates, we used CPM to calculate the correlation coefficient of the same loci among different experimental replicates to evaluate the repeatability of PTESHK. The results showed that all Pearson correlation coefficients among the different repeats were higher than 0.92, indicating that the CPMs of the same loci were significantly correlated (Fig. 3f-h). Moreover, the correlation coefficients among different samples of different individuals were also higher than 0.85, which indicates good consistency even among different individuals. If the threshold CPM was set at 50, there also seemed to be significant correlations among different repeats with a Pearson correlation coefficient of about 0.90 (Additional file 3: Fig. S2f-h).
In conclusion, with more than 900 loci detected, PTESHK exhibits good capacity for identifying HERV-K (HML-2) integration loci from a relatively small amount of data (15 million raw reads), although PTESHK has difficulty in recovering loci integrated in repetitive elements. After filtering the CPM value, PTESHK shows excellent repeatability of the number of detected HERV-K (HML-2) loci, and the relative abundance of the same loci in different replicates exhibited good repeatability.

Detection features of PTESHK
Although the broad distribution of CPMs of different loci in the same sample is consistent, we still wanted to know the reason for this difference. Following integration into the host genome, over time HERV-K (HML-2) proviral and solo LTR sequence can independently accumulate unique mutations in each element. As the strategy for the enrichment of HERV-K (HML-2) in this study was based on PCR amplification, variants in the primer regions, resulting in differences in PCR amplification preference, may result in a bias toward more recent insertions in a given sample. Since the number of mutations increases with time, we speculated that the bias correlates to the ages of the loci. Based on LTR sequences, there are three subtypes of HERV-K (HML-2) comprising LTR5_Hs, LTR5A and LTR5B, of which LTR5_Hs is considered to be the most recently integrated family and LTR5B to be the oldest and ancestral family [17]. To investigate the relationship between LTR types (representing the integration ages) and CPMs, we performed a phylogenetic analysis of all 1080 loci (including novel loci found in this study) that possess LTR elements and have been annotated for the type of LTR, the estimated ages of the loci, the ability to be detected by PTESHK, and the CPM value of each locus. All LTRs clustered well into three groups, although LTR5B had elements that clustered in the lineages of LTR5_Hs and LTR5A (Additional file 4: Fig. S3). This phenomenon can be explained by a previous study that clustered the proviral LTRs, and concluded that LTR5B is the oldest ancestor of both LTR5A and LTR5_Hs, and therefore LTR5A and LTR5_Hs arose independently and exclusively from LTR5B [17]. With the annotation of the estimated ages of each locus, our results confirmed that LTR5_Hs performed as the youngest integration group, especially for those loci clustered within the longest internal branches and shortest terminal branches containing the majority of the polymorphic loci. The observation that nearly all the loci were detected with a much higher CPM, except for some polymorphism loci, indicates that the LTR5_Hs subgroup was more readily detected with respect to the capacity of detection and CPM values of the loci. The relationship between the estimated age of HERV-K (HML-2) loci and the average CPM value is shown in Fig. 3b, which indicates that the more recent the integration, the higher the average CPM value. This observation implies that PTESHK has a propensity for detecting recently formed integrations, especially those younger than 15 Mya. In general, it is easier for PTESHK to detect recently integrated loci, especially the LTR5_Hs subgroup, which is more likely to generate viral proteins or serve as regulatory elements for its possession of functional sequences with fewer mutations.

Identification of polymorphic loci
Within the HERV-K (HML-2) family, there is a subset of HERV-K (HML-2) loci that are annotated as polymorphic loci raising attention from researchers. In addition to detecting the genome-wide distribution of HERV-K (HML-2) loci, we were interested in determining if PTESHK is also capable of identifying polymorphic loci between different individuals. To identify polymorphic loci, we used EdgeR to compare the differential loci between different individuals [29]. If the fold change of CPM values of the same locus in different individuals is higher than 20, and p-value less than 10 −10 , we considered the locus as polymorphic. Among the 3 individuals in our study, 14 HERV-K (HML-2) loci were determined to be polymorphic (Fig. 4). Twelve out of the 14 have previously been identified as polymorphic loci. In addition to the 12 known polymorphic loci, we found solo-LTR330 and N6p21.32 to be polymorphic (Fig. 4, black frames). Using specific primers for each polymorphic locus, 12 of the 14 loci were verified by nested-PCR. The exceptions were 12p12d and 6p21.32a, which may be caused by either their location in a repeat element, or they may represent a provirus integration in which the length of the products was too long to amplify (Additional file 5: Table S2, Additional file 6: Fig. S4).
There are 69 annotated polymorphic loci, of which 49 (71%) are detectable in this study. Besides the 12 HERV-K (HML-2) polymorphic loci, there are 37 annotated polymorphic loci denoted as nonpolymorphic and 20 denoted as undetectable loci. As the frequency of 62 out of 69 polymorphic loci have been annotated in Chinese or East Asian cohorts [18,19] (Additional file 7: Table S3), we further evaluated the ability of PTESHK for detecting polymorphic loci by their frequency. Although there were 20 known polymorphic loci undetected in this study, we hypothesized that, rather than being undetected by PTESHK, integration at these loci was not present in our individuals. A lack of integration of these loci in Asian cohorts has been shown previously. The frequency of the 20 undetected loci is close to zero. The exception is the 12q12 locus, with a low frequency of about 0.3. There are 18 polymorphic HERV-K (HML-2) loci fixed or nearly fixed in Chinese cohorts, based on previous studies. The frequency of these loci in our 3 individuals is consistent with the observations of previous studies. For the 12 polymorphic loci detected in this study, the frequency of 8 loci is similar to that of the earlier reports. Four loci, including 6p21.32b, 19p12b, solo-LTR651 and solo-LTR585, show a relatively higher frequency in our study, which may be caused by the limitation of the number of samples. For the same reason, some loci annotated with low frequency in Asians were indicated to be fixed in our study. In general, PTESHK performs well in identifying polymorphic loci, and may be used for the detection of disease-related polymorphic HERV-K (HML-2) loci or polymorphisms in the general population of different ethnic groups.

Discussion
HERVs and notably HERV-K insertional elements fall into two main categories: i) complete or nearly complete proviral insertions capable of synthesizing proviral RNA and in some cases producing functional viral proteins; and ii) LTR insertional elements that have the capacity for regulating expression of genes within close proximity. Integrated HERV-K proviruses and solo LTR elements have been implicated in a variety of pathological human diseases, including autoimmunity, cancer, and neurological syndromes [30][31][32][33][34]. Furthermore, HERV-K proviruses and solo LTRs have been shown to be important in early development [35]. Even though a large body of work has been directed at understanding the roles of HERV-K in development and disease, a complete understanding awaits a full description of HERV-K proviruses, solo LTRs, and their genomic locations. A significant problem in compiling a full list of HERV-K proviruses and solo LTRs and their genomic positions is the repetitive nature of these elements and the difficulty in assigning positions from next generation sequencing data [36]. In an attempt to generate a more complete catalogue of polymorphic and non-polymorphic HERV-K elements across ethnic groups, we modified and extended previous HERV-K discovery platforms (GAPS) in building a PCR-based target enrichment sequencing protocol for HERV-K (HML-2) (PTESHK), which can be widely applied different kinds of starting materials in limited amounts (100-150 ng genomic DNA). In addition to detecting reference loci with high efficiency, it is also capable of discovering novel HERV-K (HML-2) loci. Compared with other techniques, PTESHK is a robust tool with several advantages for detecting HERV-K (HML-2) insertions. First, in addition to detecting proviruses integration loci [25,26], PTESHK is also able to detect the solo-LTR integration of HERV-K (HML-2). Second, with high-throughput sequencing, hundreds of loci can be detected simultaneously. Within the 3 individuals in this study, 978 loci could be detected in at least one sample with no less than a single read mapped. PTESHK possesses a bias for the detection of the recently integrated LTR5_Hs group, which contains most of the polymorphic loci, indicative of the ability of PTESHK to detect polymorphic loci. Remarkably, 14 polymorphic loci have been identified in just 3 individuals in our study. Third, it is highly economical in that only about 15 million raw reads (about 5.4 Gb) are sufficient for characterization as compared to those studies based on WGS data with high sequencing depth (whose data volume requires around 90 Gb) [18,19,26]. Moreover, the analysis pipeline runs faster because of the lower volume of sequencing data. Fourth, enrichment of target sequences by PCR amplification is much easier to perform than using a biotin-streptavidin-based bead with DNA bait probes [16,27]. Therefore, PTESHK could be helpful for determining the distribution of HERV-K (HML-2) in the human genome because hundreds of loci can be detected simultaneously with fewer raw data and excellent repeatability. Although we focused on HERV-K (HML-2), the protocol described here could be easily adapted to study different families of HERVs and possibly other retrotransposons. The application of PTESHK among different cohorts might identify polymorphic loci in specific states, for example, in different populations and within different disease types. Although PTESHK is limited in subdivided proviruses and solo-LTR HERV-K (HML-2) loci, an optimization of the suppression PCR primers could work for detecting only proviruses HERV-K (HML-2) loci [26].
In this study, we used three biological repeats and each has three experimental repeats to calculate and evaluate the repeatability of PTESHK. As shown in Fig. 3c-e, dozens of loci (ranging from 26 to 50) will be missed if studying one individual with one sample without repeats. All these loci show a similarity in that they can only be mapped with a few reads even when they are detectable. They can be clustered into two conditions. One condition is that low number of reads identified for integration groups, LTR5A and LTR5B, could be the result of difficulties in amplifying these regions because of mutations in the primer region. The other condition are those loci within repetitive elements that are more difficult to recover because of the artificial filtration of non-specific aligned reads. However, the HERV-K (HML-2) loci clustered into group LTR5_Hs, which showed high amplification efficiency, and were not integrated into repeat elements were affected less. In addition, the phenomenon of false-negative loci can be improved by increasing the data volume.
The discovery of novel loci is important for supplementing the global distribution of HERV-K (HML-2) in the reference genome. In this study, within 3 healthy individuals, 35 novel loci have been detected, including 5 non-reference loci, and 8 out of 35 loci are intronic integrations. Based on the possible regulatory functions of HERV-K (HML-2) LTRs, novel loci located in the vicinity of protein coding genes may be candidates for regulating host biological activities. The N15q21.2 locus is located in the intron of GLDN, a gene that encodes gliomedin involved in the formation of the nodes of Ranvier and the development of the human peripheral nervous system. As mutations in GLDN are responsible for lethal arthrogryposis [37], a polymorphic integration of HERV-K (HML-2) in GLDN could influence the expression of GLDN and lead to disease. Another novel locus termed N3q22.1 is located in the intron of COL6A5, which encodes the α5 chain of the important extracellular matrix protein collagen VI. The link between COL6A5 and atopic dermatitis is controversial [38,39], but a mechanism involving HERV-K (HML-2) integration, such as alternative splicing or the generation of antisense RNA, might resolve this conflict. In addition, newly reported reference loci solo-7p14.3, solo-8p12, solo-11q12.2, solo-12q23.3 and solo-19q13.41 are also located within the introns of genes. Remarkably, two interesting novel loci detected in this study highlight the potential function of LTRs on alternative splicing. Solo-8p23.1c is located in the exon of FAM85B and provides a polyadenylation signal for transcript termination. So does solo-4p16.3, located in BC042823, a non-coding transcript that is not only terminated by the HERV-K (HML-2) LTR, but also initiated by an ERV1 LTR. Whether these variant transcripts influence the onset of disease remains unknown. In general, the discovery of novel loci offers new vantage points for the study of the pathogenesis of diseases.
Our ultimate purpose behind developing PTESHK is the identification of genome-wide distribution of HERV-K (HML-2) in the human genome to distinguish polymorphic integrations among different individuals or cohorts. PTESHK shows good detection ability for all the 69 previously denoted polymorphic loci with high consistency in the frequency, especially those fixed or absent in Chinese. For the 12 reported polymorphic loci also identified in the present study, their frequency is consistent with the observations of previous studies, except some with a higher frequency which may be caused by the limitation of the sample number. The ability to identify the polymorphisms in 19p12b (HERV-K113) and 1p13.2, unfixed loci that have been declared to be correlated with diseases, suggests PTESHK be applied in the detection of disease-related polymorphic loci. We also identified two new polymorphic loci (solo-LTR330 and N6p21.32) among the three individuals. The solo-LTR330 is located in the flanking sequence of exon 4 of HLA-DQB1, a gene encoding the beta 1 subunit of the human leukocyte antigen (HLA)-DQ surface receptor. HLA-DQB1 is one of the high-risk candidate genes for several diseases, such as type 1 diabetes [40] and breast cancer [41]. In accordance with previously reported bidirectional promoter activity of HERV-K (HML-2) LTRs [42], the potential presence of antisense transcripts of HLA-DQB1 initiated by solo-LTR330 might down-regulate the expression level of HLA-DQB1 transcript and modulate the transcription profile. This may provide new insight for how HLA-DQB1 is a hot-spot for association with several diseases. Interestingly, we found that the pre-integration (absent) loci are always mapped by a few reads instead of no read, which results in an aberrant fold change between present and absent loci. Hence, there are some points near the cut-off lines (Fig. 4). There could be two primary reasons. First, the heterogeneity of a locus in one individual may result in an aberrant fold change. Furthermore, as some of the flanking host genome sequences exhibit high similarity, when mapping the sequencing reads to the reference, there might be false alignments. This results in a few reads mapped to the absent loci and the fold change will not be infinite. We therefore created and set a filter of fold change > 20 and p-value < 10 −10 after testing for several scenarios for polymorphic. In this study, we detected 14 polymorphic loci. With PCR verification, 12 loci can be demonstrated with two exceptions where the loci are found within repetitive elements. Verification by PCR helps in supporting the identification of polymorphic loci detection using PTESHK.

Conclusions
We developed PCR-based target enrichment sequencing of HERV-K (HML-2), PTESHK, to amplify chimeric LTR-host genome sequences followed by next-generation sequencing to assess the genome-wide distribution of HERV-K (HML-2) in the human genome. With this method, hundreds of HERV-K (HML-2) integrations, including both known reference loci and novel loci, could be identified with good repeatability. In addition to the ability to characterize the distribution of HERV-K (HML-2) in the human genome, PTESHK performs well for the detection of polymorphic loci that might play paramount roles in the causation of disease.

DNA samples and extraction
Whole blood (150-200 μl) was collected from three unrelated human volunteers (males, age ranges from 25 to 29, denoted "P", "W" and "Y") from Shantou, China. Up to 3 different samples (denoted "1", "2" and "3") from each individual were prepared for genomic DNA extraction using a TIANamp Genomic DNA Kit (TIANGEN) following the manufacturer's protocol. The genomic DNA extracted from 3 blood drawings of one individual were mixed as a pool and used as a template for novel loci PCR validation. Informed consent was obtained at the time of collection.

Target enrichment amplification and NGS library construction
After optimization of the amplification reactions and library construction steps (data not shown), the detailed procedures are as follows. Target enrichment amplification was a modification of the GAPS method [25] and performed using a KAPA HyperPlus Kit (Roche), and all reactions were prepared on ice. Genomic DNA (100-150 ng) was digested randomly with KAPA Frag enzyme at 30 °C for 5 min using a pre-cooled thermocycler. After enzymatic fragmentation was performed, the fragments were subjected to end repair and A-tailing and incubated at 65 °C for 30 min. Then the GAPS linker (15 μM) was ligated to the fragments at 20 °C for 15 min according to the instructions of the manufacturer. The GAPS linker was prepared by mixing an equal volume of KNG-A1 and KNG-A2 (Additional file 8: Table S4) and annealing at 65 °C for 20 min followed by cooling to room temperature at a rate of 1 °C every 15 s. After the ligation, a post-ligation cleanup was performed immediately with AMPure XP beads (Beckman) to remove unligated linker or linker-dimer molecules.
Suppression PCR was carried out using Premix Taq ™ (Takara) in a final volume of 20 μl containing 3 ng of ligated genomic DNA and 1.25 μM of each primer. One of the primers was designed based on the linker sequence (RBX4), and the other on the HERV-K LTR sequence (consensus sequence after an alignment of all the reported HERV-K (HML-2) provirus sequences) (5LTR1 for 5′ LTR end and 3LTR1 for 3′LTR end) (Fig. 1, Additional file 8: Table S4). PCR was performed as follows: 96 °C for 1 min; 25 cycles of 96 °C for 30 s, 60 °C for 2 min, and a final extension step at 72 °C for 10 min. Nested PCR under the same conditions was then performed, with a 1:50 dilution of PCR products, with inner primers specific to the sequence of the primary PCR product (5LTR2 for 5′ LTR, 3LTR2 for 3′ LTR and RBY1 for the linker, Additional file 8: Table S4) in a final volume of 50 μl.
After the secondary PCR, the products of both 5′LTR and 3′LTR from the same sample were combined to form a 100 μl mix for a post-amplification cleanup with AMPure XP beads (Beckman) to get a 50 μl eluate of the products. The products were subjected to an end repair and A-tailing step as described above, followed by a ligation of the specific sequencing adapter (SeqCap Adapter Kit, Roche) for Illumina sequencing. After the ligation, a post-ligation cleanup and a size selection step were performed immediately for the selection of library molecules (inclusive of adapter) in the range of 250-450 bp and readied for HiSeq X Ten PE150 sequencing. To avoid errors from sequencing, all libraries were performed in one lane. The raw data of each library ranged from 5G to 10G.

Bioinformatics analysis
Paired-end reads were trimmed using the Trimmomatic program to remove low-quality reads [43]. Picard Mark-Duplicates (http://broad insti tute.githu b.io/picar d) was used to filter the PCR duplicates out of the clean data before being merged using FLASH (Fast Length Adjustment of Short Reads) [44]. Both merged reads and single reads were submitted to STEAK [27] for HERV-K mining in the unpaired mode with a bait length of 20 bp (beginning and end of the HERV-K LTR) and a percent identity of 90. The trimmed reads were then treated with Cutadapt [45] to remove the partial GAPS linker sequences and reads shorter than 10 bp after the trim were abandoned. Then the remaining reads were aligned to all HERV-K (HML-2) sequences in GRCh37 (hg19), and all mapped reads were removed by SAMtools [46] to avoid HERV-K sequence disturbance. Novoalign (http://novoc raft.com) was chosen to map reads back to the host genome hg19, and those reads uniquely mapped could be used for the detection of reference or novel HERV-K loci with BEDTools [47]. The list of known integration sites, a reference file for BEDTools, is made up of all 1063 loci that have been previously reported and is summarized in Additional file 1: Table S1. Only those fragments mapping within 10 nt of the reference loci were calculated. IGV [48] could be used for visualization and confirmation of the results.

Statistical analysis of the method
All loci within the 1098 loci that had at least one read mapped were accepted as positive loci. Different numbers of reads (100,000, 500,000, 1,000,000, 2,000,000, 3,000,000, 5,000,000, 7,000,000, 10,000,000 and 15,000,000) were extracted from the raw data of different samples (P1, P2, P3, W1, W2, W3, Y1, Y2 and Y3) for bioinformatics analysis. The value of the CPM of every detected locus was calculated to represent the relative abundance of the locus, and it was used for the evaluation of the difference of one locus among different samples.
Statistical analysis was conducted using R software (version 3.5.0). The distribution of HERV-K (HML-2) integration loci in the hg19 genome was exhibited by RIdeogram. The Chi square test was used to determine whether there was a significant difference among the number of HERV-K loci detected in different replicates of one individual. The Pearson correlation coefficient was used to compare the correlation of CPM of the same loci among different replicates, and the results were visualized by Performance Analytics. EdgeR was used to compare the differential loci among different individuals to determine polymorphic or non-polymorphic loci. Changes of the CPM value of the loci considered as polymorphic between two individuals were 20-fold or greater with a p-value ≤ 10 −10 .

New loci validation and sequencing
Primers specific for the novel non-reference loci detected by the NGS data were designed for PCR validation. PCR was carried out using Premix Taq ™ (Takara) following the manufacture's protocol. PCR products were separated on a 1.5% agarose gel, and then purified with a Universal DNA Purification kit (TIANGEN). The purified PCR products were directly ligated into the pEASY-T3 Cloning Vector (TransGen Biotech), then transformed into competent E. coli. Positive recombinants were identified by blue-white color selection and sent for Sanger sequencing.

Phylogenetic analysis
The nucleotide sequences of all 1080 HERV-K (HML-2) loci that contained an LTR, including reference loci and novel loci, were selected for the construction of a neighbor-joining tree. For those provirus insertions with two LTR elements, the 5′ LTR sequence was chosen first and if it was 5′ truncated, then the 3′ LTR sequence was used. All sequences were downloaded from NCBI (National Center for Biotechnology Information, https ://www.ncbi. nlm.nih.gov/). Nucleotide sequences were aligned using ClustalW [49] and then used to generate the tree with MEGA7 [50] using 5000 bootstraps and the pair-wise deletion option. The NJ tree was visualized by the online tool iTOL (Interactive Tree Of Life, http://itol.embl.de/) together with the annotation information obtained by early statistics analysis.