In this study, we evaluated the ability of 454 sequencing of PCR products to accurately portray HIV sequence populations. Using mixtures of cloned DNA containing wild type or mutant sequences at 13 sites associated with resistance to RT inhibitors, we investigated the frequency and mechanisms of point errors, indels, PCR-introduced recombination, and the sensitivity for detecting drug resistance mutations in three independent runs. We looked initially at recombination. We defined a recombinant sequence as one containing both WT and mutant residues generated from mixtures of the two clones. This method is limited by a small background resulting from its inability to determine if a single nucleotide change resulted from a point error or from a (usually double) recombination event in the intervals between drug resistance sites. Furthermore, we were not able to observe recombination between identical parental sequences. To maximize detectable crossover events, we used 50% wt/mutant mixtures. As our result shows (Table
7), the measured wt/mutant ratios were not exactly 50%. This likely reduced the observed recombination. To differentiate whether a mutation in a WT molecule (or a wild type nucleotide in a mutant molecule) is from a point mutation error or from a crossover event, we sequenced clones of 100% WT and 100% mutant samples as controls. Indeed, in experiments in Run1 and Run2, we observed “recombinants” from pure samples at a frequency of 0.11% to 0.73%. Sequences from these samples were not likely to have been recombinants but probably the result of point errors. For our analyses, these small values were subtracted from observed results with mixtures to obtain corrected recombination frequencies (Table
1). Our results show that standard PCR used for sample preparation produced a remarkably high frequency of artifactual recombination. Generally, frequency of crossovers depended on the length of intervals. However, base compositions appeared to affect the crossover rate as well
. Tsibris et al. reported that in vitro recombination was infrequent (0.11% to 0.15%) in a mixture of 3 different clones with the minor species at 1%. We observed 0.9% to 1.5% recombinants in our experiment with 1% mutant clone in a WT background (Table
1), meaning that more than half of the mutant sequences were involved in crossovers.. The different recombination rates in the two studies may be partially explained by different experimental designs (3 clone mixture vs. 2 clone mixture), different parental sequences (HIV-1 env V3 loop vs. HIV-1 pol), or different PCR conditions. Hedskog et al., and Mild, et al. reported 0.89% of in vitro recombination in their studies. That is lower than the frequency we obtained. The reasons could be
 the length of the fragment they used to detect crossovers of 14 signature nucleotides (RT amino acid positions 181, 184, 188, 190, 210, 215, and 219) is shorter than ours (positions 41, 65, 67, 70, 74, 100, 103, 181, 184, 188, 190, 215, 219)
. We have signature nucleotide from positions 41 to 103. The differences of the intervals and the sequence compositions from position 41 to 181 partially explain the different recombination rates observed. Additionally, Gorzer et al. showed that artifactual recombination significantly correlated with the initial amount of DNA used for PCR amplification. Mild et al. used 100,000 templates in PCR and observed 0.89% recombinants while we used 1,000,000 templates in our PCR and observed 11.65% recombinants. PCR mediated recombination results from incompletely extended primers annealing to heterologous templates and extending in the next round of elongation
[14, 16]. By modifying PCR conditions to reduce the probability of premature termination, we found that that PCR mediated recombination could be reduced by 27 fold.
We next examined point and indel errors known to arise during PCR and ultradeep sequencing. It has been previously reported that the error rates differ at homopolymer regions and non-homopolymer regions
[10, 13, 18]. We found that the point errors were evenly distributed except at homopolymer regions, particularly near the 3′ end of the sequenced region. This discrepancy is even more dramatic in indel error distributions. A high frequency of indel errors was found primarily in homopolymer regions (Figures
4D). Additionally, we found that, overall, in our study of the HIV-1 RT region, that there were more deletions than insertions in our samples. This is different from the observation by Vandenbroucke et al.. They reported 0.07%–0.14% insertions and 0.02% to 0.08% deletions in their study of the HIV-1 env V3 region. This difference may be related to sequence context
. Manual examination of the sequence alignment confirms that more deletions were produced in homopoly A regions, particularly the region near RT K103. Also, we noticed that the deletion/insertion rate was different between Run1 and Run2 while point mutation errors were very similar. This different deletion/insertion rate may reflect variation in performance of 454 sequencing from one run to the next. We also found that transversion errors were 5–10 fold lower than transition errors (Table
5). Huse et al. reported that A to G and T to C changes were more frequent than other types of changes
. Our results show that the frequencies of transitions exhibited a small bias in the same direction, but that all transitions were nonetheless more frequent than transversions.
Ultradeep sequencing has been used to identify low frequency drug resistance mutations
[3, 6, 11]. Mitsuya et al. proposed that it was unlikely that variants at a frequency > 1.0% resulted from sequencing errors. They used 1% as the cut-off for drug resistance sites and 2% for other RT sites
. A similar result was obtained by Gilles et al.. Based on 100% wild type or 100% mutant samples, we show that it is possible to use a substantially lower cutoff for some drug resistance sites because error rates were considerably lower at some sites than at others. For example, the background of K103N (A to C) and, K103N (A to T) were each 0.02%, and L74V (T to G) was 0.02%, with 95% confidence bounds of 0.01 to 0.03, and 0.00 to 0.03, respectively (Table
6). We measured the fractions of mutations in samples with mixed wild type and mutant sequences (Table
7) and found that frequencies at each site were in good agreement with expected values down to about 1%. It seems clear from these results, however, that it is possible to use this technology to measure the frequency of specific mutations, particularly transversions, down to less than 0.1%, similar to that achievable with allele-specific PCR
. In such cases, however, it is essential -- and not particularly difficult -- to include internal controls of cloned DNA (or transcripts prepared from cloned DNA) to assess the actual background frequencies achieved in any experiment.
The sources for point errors in 454 generated sequences were from both the PCR and the sequencing steps. Although errors resulting from sequencing have been reported to result in part from more than one molecule being bound to a single bead before the emulsion PCR
, this artifact cannot have caused the errors observed in sequencing cloned DNA. In any case, our data show that PCR contributed the majority (0.12+/−0.16% of PCR amplified vs. 0.02+/−0.06% of cloned, DNA Table
4) of the point error rate and that sequencing contributed primarily to indel errors. This conclusion was also suggested by Vandenbroucke et al..
We observed 0.01% cross contamination in our studies (Table
1, Run1 MID2 (100% wt), MID3 (100% mutant), and Run2 MID2 (100% wt, cloned without PCR). This effect could have resulted from laboratory error, but could also be due to cross contamination in primer synthesis resulting in mislabeling of a fraction of a sample with an incorrect MID. We have also shown that a high frequency of recombination could be introduced by standard PCR conditions. However by using the low recombination PCR conditions described here, 454 sequencing technology can be a useful tool in studying mutation linkage and haplotype composition. Our results also have shown that, while indel errors were more frequently found in homopolymer regions and occurred mainly during sequencing, point errors were more or less evenly distributed in the whole region, and occurred mainly during PCR. We found that drug resistance sites had lower point error rates compared to other sites, implying that it is possible to detect rare drug resistance mutations with high sensitivity.
In this study, we observed higher than expected mutants/wt ratios (Table
7). The differences between the expected and observed ratios could be due to the fact that a sequence read was defined as a mutant or wild type by BLAST comparing it to the wild type reference and the mutant reference. If it aligned better with wild type reference (with higher E score), then it was defined as wild type. For the purpose of Table
7, we did not separate recombinants as we did for Table
1; all sequences were assigned either to wild type or mutant. Figure
2 shows the recombination patterns in Run3MID12. It shows that the numbers of the crossover product pairs were not exactly the same. There are more sequences with more gray regions (mutations) than the white regions (wild type). Therefore, more putative recombinants in Table
7 were defined as mutants. Ratios of the mixtures were verified by ASP prior to deep sequencing so the higher than expected mutant sequences are likely due to PCR or sequencing bias.
Recently, Jabara et al. reported an experiment system in which a randomly synthesized 8 base segment (“primer ID”) was incorporated into the primer for cDNA synthesis. Consensus sequences were built from the products of PCR amplification and used for mutations detection. By consensus sequence construction, minor sequencing errors and recombination produced by PCR can be removed.