Origin and recent expansion of an endogenous gammaretroviral lineage in canids

Mammalian genomes contain a fossilized record of ancient retroviral infections in the form of endogenous retroviruses (ERVs). We used whole genome sequence data to assess the origin and evolution of the recently active ERV-Fc gammaretroviral lineage based on the record of past infections retained in the genome of the domestic dog, Canis lupus familiaris. We identified 165 loci, including 58 insertions absent from the dog reference assembly, and characterized element polymorphism across 332 canids from nine species. Insertions were found throughout the dog genome including within and near gene models. Analysis of 19 proviral sequences identified shared disruptive mutations indicating defective proviruses were spread via complementation. The patterns of ERV polymorphism and sequence variation indicate multiple circulating viruses infected canid ancestors within the last 20 million to within 1.6 million years with a recent bust of germline invasion in the lineage leading to wolves and dogs.


Introduction 40
During a retroviral infection, the viral genome is reverse transcribed and the resulting DNA is then 41 integrated into the host genome as a provirus. In principle, the provirus carries all requirements 42 necessary for its replication, and typically consists of an internal region encoding the viral genes 43 (gag, pro/pol, and env) flanked by two regulatory long terminal repeats (LTRs) that are identical 44 at the time of integration. Outermost flanking the provirus are short, 4-6 bp target site duplications 45 (TSDs) of host genomic sequence generated during integration. Infection of such a virus within a 46 germ cell or germ tissue may lead to an integrant that is transmitted vertically to offspring as an 47 endogenous retrovirus (ERV). Over time, the ERV may reach detectable frequencies within a 48 population or even fixation within a species (Boeke and Stoye, 1997). Through repeated germline 49 invasion and expansion over millions of years, ERVs have accumulated to such an extent that 50 they account for considerable proportions of genetic sequence of many mammalian genomes. 51 ERVs have been commonly referred to as 'fossils' of their once-infectious counterparts, 52 providing a record of exogenous retroviruses that previously infected a species (Boeke and Stoye, 53 1997). Because an ERV switches from a relatively rapid evolutionary state as an infectious virus 54 to a relatively slow one while replicated as part of the host genome, recently formed ERVs tend 55 to bear close resemblance to their exogenous equivalent and possess a greater potential to retain 56 functional properties. Across species, the majority of ERVs are thought to provide no advantage 57 to the host, and have, for the most part, been progressively degenerated over time due to 58 accumulated mutations or from recombination between the proviral LTRs that replaces the full-59 length sequence with a solitary LTR, or 'solo LTR' (Boeke and Stoye, 1997). However, increasing 60 evidence suggests evolutionary roles in host physiology via gene regulation, for example by 61 providing alternative promoters, enhancers, splice sites, or termination signals (Chuong et al., 62 4 integrants resulting from relatively recent endogenization events that segregate as unfixed alleles 74 within the species. Examples include the cervid gammaretrovirus in mule deer populations of 75 North America (Elleder et al., 2012) and the insertionally polymorphic ERVs found in domestic 76 and wild cats (Roca et al., 2004, Troyer et al., 2004. Other species are hosts to infection from 77 exogenous viruses that have been shown to contribute to new germline infections, for example, 78 the Koala retrovirus (KoRV) is in the midst of transitioning to an endogenous state in Australia 79 (Ishida et al., 2015, Tarlinton et al., 2006, Lober et al., 2018. Recombination between distinct 80 ERV RNAs that are co-packaged in the same virion may also contribute to new viruses that have 81 altered pathogenic properties (Stocking and Kozak, 2008). 82 The endogenous retroviruses classified as ERV-Fc are distant relatives of extant 83 gammaretroviruses (also referred to as gamma-like, or g-like). As is typical of most ERV groups, 84 ERV-Fc is named for its use of a primer binding site complementary to the tRNA used during 85 reverse transcription (tRNA phe ). Previous analysis of the pol gene showed that ERV-Fc elements 86 form a monophyletic clade with the human g-like ERV groups HERV-H and HERV-W (Jern 2005). 87 As is common to all g-like representatives, members of the ERV-Fc group possess a 'simple' 88 genome that encodes the canonical viral genes and lacks apparent accessory genes that are 89 present among complex retroviruses. The ERV-Fc group was first characterized as a putatively 90 extinct, low copy number lineage that first infected the ancestor of all simians and later contributed 91 to independent germline invasions in primate lineages (Benit et al., 2003). It has since been shown 92 that ERV-Fc related lineages were infecting mammalian ancestors as early as 30 million years 93 ago and subsequently circulated and spread to a diverse range of hosts, including carnivores, 94 rodents, and primates (Diehl et al., 2016). The spread of the ERV-Fc lineage included numerous 95 instances of cross-species jumps and recombination events between different viral lineages, now 96 preserved in the fossil record of their respective host genomes (Diehl et al., 2016). 97 In comparison to humans and other mammals, the dog (Canis lupus familiaris) displays a 98 substantially lower ERV presence, with only 0.15% of the genome recognizably of retroviral origin 99  (Figure 1). Thus, the analysis provided a broad timeline to reconstruct the 281 evolutionary history of this ERV lineage ranging from host divergences within the last tens of 282 thousands of years (gray wolves) to several millions of years (true foxes). 283 In total, we in silico genotyped 145 insertions (89 reference and 56 non-reference loci) 284 across 332 genomes of canines and wild canids (refer to Methods; Table S4). To more accurately 285 facilitate the identification of putative population-specific CfERV-Fc1(a), and to distinguish 286 possible dog-specific insertions that may have occurred since domestication, wolves with 287 considerable dog ancestry were removed from subsequent analyses (see Methods). Alleles 288 corresponding to reference (i.e., CanFam3.1) and alternate loci were recreated based on the 289 sequence flanking each insertion while accounting for TSD presence. We then inferred genotypes 290 by re-mapping Illumina reads that spanned either recreated allele for each site per sample. 291 Reference insertions were deemed suitable for genotyping only if matched TSDs were present 292 with clear 5' and 3' LTR junctions. We excluded the two non-reference sites with only a single 293 assembled LTR junction due to uncertainty of both breakpoints (above). To facilitate genotyping 294 of the eight unresolved assemblies with linked 5' and 3' LTR junctions, we supplemented the 295 Repbase CfERVF1_LTR consensus sequence over the missing region (lower case in Table S2). 296 As has been discussed in earlier work (Wildschutte et al., 2016), this genotyping approach is 297 limited by the inability of single reads to span the LTR; therefore, the data do not discriminate 298 between the presence of a solo LTR from that of a provirus at a given locus. 299 Insertion allele frequencies ranged from 0.14% (inferred single insertion allele) to fixed 300 across samples ( Figure 5; all raw data is included in Table S5). The rarest insertions were found 301 in gray wolves, the majority of which were also present in at least one village or breed dog (for 302 example, see chr13:16,157,778 and chr15:32,084,977 in Figure 5). All non-reference insertions 303 were variably present in Canis species, and only few had read support in outgroup species (i.e. 304 foxes, dhole). Notably, there was no evidence for the presence of any loci specific to village or 305 breed dogs. Of outgroup canids, ~33% (48 of 145) insertions were detected in the Andean fox, 306 and ~50% (a total of 73) insertions were present in the dhole. The remaining foxes, representing 307 the most distant splits of extant canids, had the lowest prevalence of occupied loci, with just five 308 insertions found in the gray and Island foxes, respectively. However, this is not unexpected since 309 insertions private to these lineages would not be ascertained in our discovery sample set. 310

11
The relative distribution of proviruses was in general agreement with dating via LTR 311 divergence, though some inconsistencies were observed. No

Evolution of the CfERV-Fc1(a) lineage in Canidae 327
LTR sequences are useful in a phylogenetic analysis for exploring the evolutionary patterns of 328 circulating variants prior to endogenization, as well as following integration within the host. To 329 infer the evolutionary history leading to CfERV-Fc1(a) presence in modern canids, we constructed 330 an LTR tree using as many loci as possible (from 19 proviral elements and 142 solo-LTRs) ( Figure  331 6; Table S6). 332 In broadly comparing LTR placement to our inferred species presence (Figure 6), the 333 longer-branched clusters contained the few ancestral loci present in the outgroups (gray and red 334 foxes) and those that were mostly fixed among the other surveyed species. However, at least two 335 non-reference LTRs and other unfixed insertions were also in these clades, suggesting their more 336 recent formation from related variants therein. One provirus was present within the most basal 337 clade, and four (including the duplicated locus) were present within intermediate clades. We 338 observed a major lineage (upper portion of tree) that included the majority of recent integrants. 339 This lineage gave rise to the greatest number of polymorphic insertions, including a derived clade 340 of insertions that appears to be Canis-specific, with some sites restricted to one or two sub-341 populations. This lineage also contains the majority of proviral LTRs (15 of 19 included in the 342 analysis), most possessing intact pol and/or env genes. The youngest proviral integrants, as 343 inferred from high LTR identities and prevalence among sampled genomes, tend to be on short 344 branches within derived clusters that contain the majority of unfixed loci, likely reflecting their 345 source from a relatively recent burst of activity in Canis ancestors. 346 Within the germline, the highest occurrence of recombination resulting in a solo LTR takes 347 place between identical LTRs (Belshaw et al., 2007, Stankiewicz andLupski, 2002), implying the 348 LTR sequence itself is preserved in the solo form. Under this assumption, the presence of identical 349 solo LTR haplotypes should, in principle, indicate their origin from a common ancestral source. 350 We identified four such LTR haplotypes within the Canis-specific clades, including loci in co-351 clusters with one of two proviruses (chr3:82,194,219 and chr4:22,610,555), therefore bounding 352 the inferred age of these insertions to within the last 1.64 mya (dashed lines in Figure 6). Between 353 the four identical clusters, the LTR haplotypes shared nucleotide identity ranging from 99.3% 354 (three substitutions from a consensus of the four clusters) to 99.7% (one substitution), suggesting 355 their origin from related variants over a common timeframe. We modified our dating method to 356 obtain an estimated time of formation across each cluster by considering the total concatenated 357 LTR length per cluster, as has been similarly employed elsewhere (Ishida et al., 2015). This As an endogenous element, an ERV may retain close resemblance to its exogenous source over 379 long periods of time with recent integrants assumed to possess a greater potential to retain the 380 properties of the infectious progenitor. We combined the eight non-reference proviruses with the 381 eleven reference insertions to generate an updated consensus (referred to here as CfERV-382 Fc1(a) CON ) as an inferred common ancestor of the CfERV-Fc1(a) sublineage. A detailed 383 annotation of the updated consensus is provided in Figure S3 and summarized as follows.

Properties of individual CfERV-Fc1(a) proviruses 410
We assessed the properties of individual full-length elements for signatures of putative function 411 (summarized in Figure 7). With the exception of the gag gene, we identified intact ORFs in several 412 14 reference copies and most of our non-reference sequenced proviruses. A reading frame for the 413 pol gene was present in six proviruses; of these, all contained apparent RT, RnaseH, and 414 integrase domains without any changes that would obviously be alter function. Likewise, an env 415 ORF was present among seven proviruses, of which all but one contained the above mentioned 416 functional domains (the SU-TM cleavage site is disrupted in the chr5:10,128,780 provirus: RRKA). 417 Comparison of the rate of nonsynonymous (d N ) to synonymous (d S ) nucleotide substitutions for 418 the seven intact env reading frames revealed an average d N /d S ratio of 0.525, indicating moderate 419 purifying selection (p = 0.02, Nei-Gojobori method). The hydrophobicity plot of each env ORF was 420 in agreement with that of the CfERVFc1 CON provirus, with predicted segments for a fusion peptide, 421 TM region, and ISD. Comparison to the pol and env translated products that would be predicted 422 from the CfERVFc1 CON inferred the individual proviruses shared 98.4% to 99.3% (Pol) and 98% 423 to 99.6% (Env) amino acid identity, respectively, and each was distinct from the inferred 424 We observed evidence for the mobilization of CfERV-Fc1(a) copies via complementation. 482 For example, examination of the proviral gene regions revealed inherited frameshift-causing 483 indels and common premature stops that were variably present among the majority of elements 484 (a total of 12 of the 19 proviruses; also see Figure 7). At least three distinct frameshifts leading to 485 a stop within gag were shared over several elements (from the Fc1 CON   Mammalian genomes are littered with the remnants of retroviruses, the vast majority of which are 514 fixed among species and present as obviously defective copies. However, the genomes of several 515 species harbor some demonstrably ancient ERVs whose lineages contain relatively intact loci and 516 are sometimes polymorphic, despite millions of years since integration. Such ERVs have the 517 potential to exert expression (of either proviral-derived or host-derived products by donation of an 518 LTR) that may affect the host, especially for intact copies or those within new genomic contexts. 519 In particular, ERV expression from relatively recent integrants has been linked to disease 520 (reviewed in Coffin, 2008, Mager andStoye, 2015)). However, there is also growing 521 evidence that many fixed loci have been functionally co-opted by the host or play a role in host 522 gene regulation (reviewed in (Frank and Table S3). Though connections between LTR 561 presence and physiology are yet to be determined, these findings will inform future investigations 562 into the potential effect of CfERVs on host biology. rates to our data would infer much younger integration times of 11.85 mya to <0.91 mya and 6.1 574 mya to <0.48 mya, respectively. We note the precision in ERV-Fc1(a) age estimations using this 575 method is subject to the accuracy of the inferred background mutation rate, but may be skewed 576 due to the presence of post-insertion sequence exchange between LTRs. As the latter cannot be 577 conclusively ruled out, we interpret our estimations as broad formation times only. CfERV-Fc1(a) activity coincided with major speciation events in canine evolution ( Figure  593 8B-E). Taking into consideration the above approaches for age estimations, we refined the dating 594 of endogenization events by integrating inferred ages with that of orthologous presence/absence 595 patterns across numerous canid lineages, many of which are recently diverged clades. The 596 analysis served two purposes. First, we made use of the tenet that ERV integration is permanent 597 and the likelihood of two independent integration events at the same locus is negligible. In this 598 way, the presence of an ERV insertion that is shared between individuals or species supports its 599 origin in a common ancestor. Therefore, integration prior to or following the split of two or more  Table S1. Reference insertions corresponding to deletion variants were inferred using the program Delly 714 (v0.6.7) (Rausch et al., 2012), which processed BAM alignment files from samples indicated in 715 Table S1 using a MAD score cutoff equal to 7, and a minimum map quality score threshold of at 716 least 20. Resulting reference deletions with precise breakpoint predictions were next intersected 717 with 'CfERVF1' reference coordinates based on RepeatMasker annotations of CanFam3.1. Only 718 deletion calls corresponding to sizes of a solo LTR (400-500 bp) or a full-length provirus (7-9 kb) 719 were considered for further analysis. The chromosomal positions of candidate non-reference ERVs were first identified using the 725 program RetroSeq (Keane et al., 2013). Individual BAM files were queried using RetroSeq 726 discovery to identify ERV-supporting discordant read pairs with one read aligned to the sequences 727 corresponding to 'CfERVF1' and 'CfERVF1_LTR' from RepBase (Jurka et al., 2005). Individual 728 BAM files were merged for subsequent steps using GATK as described (Wildschutte et al., 2016). 729 RetroSeq call was run on the merged BAM files requiring ≥2 supporting read pairs for a call and 730 output calls of levels 6, 7, and 8 further assessed, resulting in 2,381 candidate insertions. Output 731 calls within ±500 bp of an annotated CfERV from the above queried classes were excluded to 732 eliminate false calls of known loci. ERV-supporting read pairs and split reads within a 200 bp 733 window of the call breakpoint were subjected to de novo assembly using the program CAP3 734 (Huang and Madan, 1999). Output contigs were filtered to identify ERV-genome junctions 735 requiring ≥30 bp of assembled LTR-derived and genomic sequence in the form of (i) one LTR-736 genome junction, (ii) linked assemblies of 5' and 3' LTR junctions, or (ii) a fully resolved LTR 737 (~457bp) with clear breakpoints that mapped to CanFam3.1. Contigs that contained putative 738 CfERV junctions were then aligned back to the reference to precisely map the insertion position 739 of each call. Assembly comparisons were visualized using the program Miropeats (Parsons, 740 1995).  Table S2). The sequences from validated and completely 800 assembled LTRs were utilized for allele reconstruction of non-reference sites. For example, the 801 validated sequences for the non-reference solo LTRs at chr2:32,863,024 (8 bp LTR extension) 802 and chr32:7,493,322 (associated with deletion of reference sequence) were included for 803 genotyping of alternate alleles. For sites with linked, but non-resolved, 5' and 3' assembled 804 junctions (i.e., missing internal sequence), we substituted the internal portion of each element 805 from the Repbase CfERVF1 consensus (see Table S2), and used the inferred sequence for allele 806 reconstruction. Insertion and pre-insertion alleles were then recreated based on ±600bp flanking 807 each insertion point relative to the CanFam3.1 reference, accounting for each 5bp TSD pair. For 808 each sample, genotype likelihoods were then assessed at each site based on re-mapping of those 809 reads to either allele, with error probabilities based on read mapping quality (Li, 2011, Wildschutte 810 et al., 2015, excluding sites without re-mapped reads for a given sample. Read pairs for which 811 both reads mapped to the internal portion of the element were excluded to avoid false positive 812 calls potentially introduced by non-specific alignment. The pipeline for genotyping is available at 813 https://github.com/KiddLab/insertion-genotype. The genotyped samples were sorted by ancestral 814 population, and allele frequencies estimated for the total number of individuals per population 815 genotyped at each locus (Table S5). Plink (Purcell et al., 2007), sites were filtered to remove those with missing genotypes in at least 821 ten percent of samples, those in LD with another SNP within 50 bp (--indep-pairwise 50 10 0.1), 822 and randomly thinned to 500,000 SNPs. To reduce the bias of relatedness, the sample set was 823 further filtered to remove duplicates within a single modern breed, leaving 254 samples (Table  824 S7). Identification of wolf samples with high dog ancestry was made through five independent 825 ADMIXTURE (Alexander et al., 2009) analyses of the thinned SNP set with random seeds 826 (558905, 110684, 501738, 37781236, and 85140928) for K values 2 through 6. Since we aimed 827 to discern cfERV-Fc1(a) insertions that may be dog-specific (i.e. having occurred since 828 domestication), we removed any gray wolf that had high dog ancestry from further analysis. To 829 do this, we calculated average dog ancestry within gray wolves at K=3 across all runs, which was 830 the K value with the lowest cross validation error rate. Wolves with greater than 10% dog ancestry 831 (an Israeli (isw01) and Spanish (spw01) wolf) were excluded from subsequent species and sub-832 population assessments. 833 834

Phylogenetic analysis 835
Nucleotide alignments were performed using MUSCLE (Edgar, 2004) followed by manual editing 836 in BioEdit (Hall, 1999) for intact CfERV-Fc1(a) LTRs from 19 proviral elements and 142 solo-837 LTRs. Of non-reference elements, the solo LTR with a 388 bp internal deletion at 838 chr22:57,677,068 was excluded, as was the 141 bp truncated solo LTR at chr5:80,814,713. We 839 also excluded partially reconstructed insertions corresponding to 'one-sided' assemblies or sites 840 with linked 5' and 3' assembled junctions but that lacked internal resolution (Table S1)          A.