Whole genome sequencing of 51 breast cancers reveals that tumors are devoid of bovine leukemia virus DNA

Controversy exists regarding the association of bovine leukemia virus (BLV) and breast cancer. PCR-based experimental evidence indicates that BLV DNA is present in breast tissue and that as many as 37% of cancer cases may be attributable to viral exposure. Since this association might have major consequences for human health, we evaluated 51 whole genomes of breast cancer samples for the presence of BLV DNA. Among 32 billion sequencing reads retrieved from the NCBI database of genotype and phenotype, none mapped on different strains of the BLV genome. Controls for sequence divergence and proviral loads further validated the approach. This unbiased analysis thus excludes a clonal insertion of BLV in breast tumor cells and strongly argues against an association between BLV and breast cancer.


Background
BLV naturally infects cattle, water buffalo, yak and zebu [1][2][3][4]. Sporadic infections with BLV have occasionally been reported in other species like alpaca [5]. Experimentally, BLV can also be transmitted to a number of species including sheep [6], goats [6], rats [7] and rabbits [8]. BLV infection causes B cell lymphocytosis, leukemia and/or lymphoma in natural and some experimental hosts [1]. There is also controversial evidence suggesting that BLV might infect humans: (1) antibodies against the BLV capsid were detected in 74% of human sera from the Berkeley Community, California [9], (2) BLV DNA was detected in breast tissues using PCR [10][11][12]. Based on a positive correlation between the rates of BLV infection and tumor frequencies (36-59% compared to 29-45% in normal tissue), as many as 37% of breast cancer cases may be attributable to BLV exposure [12].
Although these observations initiated some skepticism within the scientific community [13], the potential consequences for human health clearly require further investigation.

Results and discussion
To avoid potential experimental artifacts associated with DNA amplification techniques, we directly analyzed whole genomes of breast tumors and adjacent tissues. After retrieval of raw DNA sequences from the NCBI dbGaP [14,15], paired-reads were probed for alignment on different BLV strains using Bowtie2. As a positive control, a nuclear DNA fragment (chr12: 53,959,600-53,964,000) devoid of repeated sequences that would lead to an overestimation of aligned reads and set to 4.4 kb to fit with the monoploid 8.8 kb BLV genome was selected from the human genome. Alignment of 51 breast tumors genomes on the nuclear control sequence identified between 283 and 1287 paired-reads (illustrated on Fig. 1 and summarized on Table 1). In contrast, no homology was found with 5 different BLV subtypes (highlighted in blue on the phylogenic tree of Fig. 2a). In 19 biopsies adjacent to the breast tumors, 386-1197 paired-reads aligned onto the nuclear DNA sequence whereas none mapped on BLV (Table 1). All DNA samples contained extranuclear DNA as indicated by alignment of a control mitochondrial sequence (NC_012920) ( Table 1).
Although no paired-read corresponding to five different BLV variants could be identified, the possibility remains that extensive sequence variability impaired detection. On average, the whole genome sequencing procedure generated 660 million reads per sample. Given that the BLV provirus length is 8.8 kb and that a normal human diploid genome is 6.6 billion base pairs, the average number of reads that would be generated by a 8.8 kblong monoploid sequence is 880 (660,000,000/6600,000,0 00 × 8800). Providing that the BLV provirus is integrated in a single copy per cell, the whole genome sequencing procedure would thus generate 880 reads on average. If the strain in the sample diverges from the five reference sequences, a fraction of the reads would not be retrieved. Therefore, BLV variants were artificially generated in silico by introducing 2, 3, 6, 10 and 20% nucleotide changes in reference AF033818 (mutants 0.02, 0.03, 0.06, 0.10 and 0.20, respectively). Phylogenetic analysis of Fig. 2a illustrates that in silico generated divergence far exceeds the maximal natural sequence variations observed worldwide [16]. 880 Illumina-like reads were then simulated from these in silico variants using ART simulation tool and mapped on BLV genome AF033818. Most reads (818 of 880) generated from mutant 0.02 aligned on reference sequence AF033818 (Fig. 2b). Even the highly divergent mutant 0.10 still aligned 41% of its 880 reads on the reference. Up to 20% divergence in mutant 0.20 was required to significantly impair detection, although BLV specific reads were still identified (Fig. 2b).
Whole genome analysis thus excludes clonal integration of natural and highly divergent BLV strains in breast tumors. Since only a small proportion of cells may carry the provirus, the sensitivity of the analysis was correlated to the proviral loads. Any natural BLV variant that would infect 10% of the tumor cells is expected to generate about 100 reads (Fig. 2c, dotted blue line). The number of expected reads decreases along with the percentage of infected cells to reach approximately one read with a proviral load of 0.1% (Fig. 2c, dotted blue line). Considering a 59% prevalence of breast tumors positive for BLV [12], 30 samples out of our 51 should be positive. Even with an individual proviral load around 0.1%, this should make about 30 reads (on average one per patient) mapping on BLV, whereas none were found.
Using whole genome analysis, we concluded that there is no evidence for a single BLV-specific or even related sequence. The discrepancies and limitations of this report and others pertain to: 1. The origin of the samples It is indeed possible that tumor biopsies from previous studies originating from US [11,12] and Colombia [10] significantly differ from those reported in the dbGaP NCBI database. Even if we restrict our observations on US originating samples (n = 35), the discrepancy remains highly significant. Indeed, Buehring reported 67 breast tumors positive for BLV over 114 cases [12] whereas we found none over 35 cases (the p value for fisher test is 1.12 × 10 −6 ). 2. The DNA extraction technique In situ PCR suggested that BLV proviral DNA is localized in the cytoplasm [11,12]. Analysis of mitochondria-specific sequences       (Table 1) shows that dbGaP NCBI database includes reads corresponding to 16 kb-long, circular and extranuclear mitochondrial DNA. 3. The strain divergence Artificial in silico simulation of highly divergent mutants still identified BLV specific reads (Fig. 2b). Since nucleotide substitutions among BLV strains worldwide are limited to 2.3% [16], it remains questionable whether these mutants still belong to the same species. Further analysis show that breast tumor genomes do not map on HTLV-1 sequences (data not shown). Why BLV-conserved sequences were previously identified by PCR remains an enigma. 4. Viral expression Although BLV is expressed at trace levels in the bovine species, the p24 viral capsid protein was detected in 5% of breast tumors [12]. This observation is inconsistent with RNASeq analysis of 154.7 billion of transcriptome sequencing reads from The Cancer Genome Atlas Research Network [17,18].
Our present study based on whole genome analysis excludes a clonal insertion of BLV in tumor cells and does not support converging lines of evidence which previously suggested an association between BLV infection and breast cancer.

Methods
Raw DNA sequences from whole genomes of breast tumors and normal breast tissues adjacent the tumor were retrieved from the NCBI database of genotype and phenotype (dbGaP). These sequences were extracted from two studies: (1) estrogen receptor positive breast cancer: aromatase inhibitor response study (accession number phs000472) [14] and (2) sequence analysis of mutations and translocations across breast cancer subtypes (accession number phs000369) [15]. Archive files were downloaded with prefetch v2.5.7 and sequencing reads were extracted with fastdump v2.5.7 using "split-3" option to separate paired reads and single reads (NCBI SRA Toolkit). Paired reads were probed for alignment on different BLV variants (accession numbers: AF033818, AF275515, D00647, K02120, LC080667) and, as positive control, on human genomic sequences using Bow-tie2 (version 2.2.5). We used the "very-sensitive" option of Bowtie2 to maximize the likelihood of viral detection. Fig. 2 Analysis of sequence variation and proviral load in sequence alignments. a Neighbour-joining phylogenetic tree of BLV and HTLV-1 genomes. b Using the ART simulation tool (NIH), Illumina-like 100 bp paired-reads were generated in silico from the mutants. 880 simulated reads were probed for alignment on BLV AF033818 using Bowtie2 and visualized using IGV. c Correlation between proviral loads and predicted number of reads