Expression pattern analysis of transcribed HERV sequences is complicated by ex vivo recombination

Background The human genome comprises numerous human endogenous retroviruses (HERVs) that formed millions of years ago in ancestral species. A number of loci of the HERV-K(HML-2) family are evolutionarily much younger. A recent study suggested an infectious HERV-K(HML-2) variant in humans and other primates. Isolating such a variant from human individuals would be a significant finding for human biology. Results When investigating expression patterns of specific HML-2 proviruses we encountered HERV-K(HML-2) cDNA sequences without proviral homologues in the human genome, named HERV-KX, that could very well support recently suggested infectious HML-2 variants. However, detailed sequence analysis, using the software RECCO, suggested that HERV-KX sequences were produced by recombination, possibly arising ex vivo, between transcripts from different HML-2 proviral loci. Conclusion As RT-PCR probably will be instrumental for isolating an infectious HERV-K(HML-2) variant, generation of "new" HERV-K(HML-2) sequences by ex vivo recombination seems inevitable. Further complicated by an unknown amount of allelic sequence variation in HERV-K(HML-2) proviruses, newly identified HERV-K(HML-2) variants should be interpreted very cautiously.

tion cost decreases, more and more recombination events are introduced in the explanation in favor of mutation events. The first recombination event reduces the number of mutations needed for an explanation by the largest factor. RECCO builds a list of recombination events and displays the amount of mutation cost saved by each recombination -the so called "savings" of a recombination. RECCO also computes the total mutation cost of the explanation that includes this recombination event (see Table 2). True recombinant sequences usually display a strong reduction in mutation cost (i.e. a high savings value) for the first few recombination events introduced.
To quantify the statistical significance of each recombination event, RECCO generates sets of alignments by permuting the columns of the alignment. As a result, the permuted alignments do not contain any recombination signal, but have the same diversity as the original alignment. P-values are then estimated by computing the probability of obtaining higher savings than observed in the given alignment purely by chance, based on the analysis of the set of permuted alignments. We report the p-values for the query sequence here, as our goal was to analyze the recombination signal for the query sequence only.

Treatment of gaps in RECCO analysis
Treating gaps correctly was critical for the analysis of the HERV-KX sequences, as the multiple alignment contained two long gaps and several small gaps. Recently published recombination detection methods usually implement one of the three following options: (i) discard sites that contain a gap character, (ii) treat each gap character as a fifth nucleotide state or (iii) treat each consecutive run of columns containing gaps as a large polymorphism (Geneconv [2]). The first option results in an unacceptable loss of information, in our case.
For example, the 96 bp indel differentiates between evolutionarily young and old HERV-KX sequences [3]. The second option may lead to an artificially high similarity or dissimilarity between sequences in gap regions and eventually produces spurious recombination events.
The third option prohibits recombinations in any run of columns containing a gap, such that a sequence containing a long gap may confound recombinations that involve other sequences. It is also difficult to choose an adequate scoring term for the resulting large polymorphisms. In conclusion, all existing approaches for treating gaps either discard a lot of information and thus miss recombination events or may infer spurious recombination events solely based on gap information.
We decided to implement an approach that discriminates between possibly spurious recombination events based on gap information and recombination events based on polymorphisms.
First, it is important to realize that gaps in the query sequence have a different quality than gaps in the sequences used for an explanation. If gaps in the query sequence are matched with nucleotides in the explanation, the involved sequence in the explanation represents irrelevant information. Hence, gaps in the query sequence are assigned zero cost, such that all columns with a gap in the putative recombinant are effectively removed from the alignment. The situation is totally different if nucleotides in the query sequence are matched with a gap in the explanation. In this case, there is missing information as the query sequence is not fully explained by the other sequences. Consequently, we have chosen to penalize gaps in the explanation.