The Women's Interagency HIV Study (WIHS) is a multicenter, prospective cohort study to investigate the impact of HIV-1 infection on women . In 1994, 2,628 women (2,059 HIV-1 positive and 569 HIV-1 negative) were recruited by both institution and community based programs. Every six months the participants met with study personnel for an encounter termed a "visit", during which WIHS participants are interviewed using a structured questionnaire and received a physical examination . Informed consent was obtained from all study participants at the individual WIHS sites and human experimentation guidelines of the individual sites and of the Johns Hopkins Bloomberg School of Public Health were followed in the conduct of this research.
Fifty-eight HIV-1 infected individuals contributing 123 study visits were selected for analysis. All samples were from visits that occurred between initiation of the WIHS and 2000. All participants met the following criteria: 1) A defined IDU status 2) a visit within 12 months prior to initiating highly active antiretroviral therapy (HAART) 3) a viral load >4,000 copies/ml of plasma to avoid re-sampling the same virion  and 4) a CD4 T cell count <200 on the last pre-HAART visit as an indication of disease progression. Nineteen IDU (33%) met these criteria and from the non-IDU that met the criteria 39 (67%) were randomly selected for further analysis.
The median age of the 58 WIHS participants at baseline was 38 years, the overall median log10 (HIV-1 RNA) level was 4.80 cps/ml and the overall median CD4+ cell count was 311 cells/mm3. The majority (64%; n = 37) of study participants were African-American. Compared to the non-IDUs, IDUs had higher median log10 HIV-1 RNA levels (4.97 (4.40, 5.34) vs. 4.66 (4.15, 5.26)) and lower median CD4+ cell counts (200 (85, 479) vs. 359 (133, 572)), but the differences were not statistically significant. Racial composition did not differ between the IDU and non-IDU groups. Study participants reporting a history of IDU were older than those not reporting IDU prior to enrollment (40 vs. 35; P = 0.03). All participants reporting IDU were HCV positive at baseline, while only 4 (11%) of the non-IDUs were HCV positive (P < 0.01). Although treatment was initiated from different sites within the multi-centered WIHS cohort, treatment was generally based on the standard of therapy at the time of each subject's study visit. Among non-IDUs, 20 (51%) participants reported using monotherapy or combination therapy prior to study entry, compared to 12 (63%) participants with a history of IDU (P = 0.72). All monotherapy and combination therapy reported prior to study enrollment consisted of only nucleoside and/or non-nucleoside reverse transcriptase inhibitors.
A total of 1100 cloned sequences of the pol gene and 1100 of the env gene of HIV-1 were obtained as described in an earlier study . Additional sequences were obtained for this study to fill in some sampling gaps, for a total of 1,127 pol and 1236 env sequences. Our goal was to sample 10 sequences for each gene from each visit (1230 total sequences for each gene given 123 visits), but occasionally that goal was not meet, with the smallest sample size per gene per visit being four. HIV-1 RNA was isolated from stored samples of plasma using the QIA amp viral RNA mini-kit (QIAGEN, Valencia, California, USA). The isolated RNA was subjected to RT-PCR (Life Technologies Superscript One-Step RT-PCR for long templates). To avoid contamination among subject visits, all plasma samples from a subject visit were processed for reverse transcription and amplification singly (one at a time) in a PCR clean room within the laboratory in which no amplified specimens were permitted. After sequencing, all sequences from the study population were aligned and placed on a single phylogenetic tree to ensure that there were no closely related sequences appearing among different individuals. In eighteen instances (out of the 2364 total sequences) an env or pol sequence was indeed phylogenetically located within a monophyletic cluster defined by the sequences from a different subject. All eighteen sequences were regarded as potential contaminants and excluded from all subsequent analyses.
For the pol gene, we used the primers pro-1 (TTGGAAATGTGGAAAGGAAGGAC) and RT-0 (CATATTGTGAGTCTGTTACTATGTTTAC) with cycles of 50°C 30 minutes, 94°C 2 minutes, and 35 cycles of 94°C 40 seconds, 50°C 40 seconds, 68°C 3 minutes, followed by one cycle of 72°C 10 minutes and then held at 4°C. A second round PCR was run using the Gene Amp XL PCR kit (Roche Applied Biosystems, Indianapolis, IN), with the primers pro-3 (GAGCCAACAGCCCCACC) and RT-3 (GCTGCCCCATCTACATAGAA); with an amplification protocol of 94°C for 1 min, followed by 35 cycles of 94°C for 40 seconds, 52°C–56°C for 40 seconds, 68°C for 2 minutes, 30 seconds, followed by one cycle of 72°C for 10 minutes with the product held at 4°C until it was harvested and run on an 8% agarose gel. A band at the 1,617 base-pair size was extracted from the gel using the QIA Quik Gel Extraction Kit (Qiagen, Valencia, California, USA), and the obtained DNA was ligated into the TOPO 2.1 vector and transformed into TOPO 10 competent cells (Qiagen, Valencia, California, USA), according to the manufacturer's instructions. The transformed cells were plated on LB agar plates containing 50 μg/ml Ampicillin and 40 μl of 40 mg/ml X-gal. Confirmed transformants were grown overnight and plasmid DNA was extracted for sequencing, using an ABI prism 3700 DNA Analyzer (Perkin Elmer Biosystems, Boston, Massachusetts, USA). The cloned sequences were obtained in nucleotide format and translated into amino acids using MegAlign software by DNAStar (DNASTAR Inc., Madison, WI). The entire protease (PR) region (297 nucleotides) and partial reverse transcriptase (p RT) region (674 nucleotides, including all known sites of resistance mutations) were available from each of the 123 study visits . The pol sequences generated are available through Genbank, Accession Numbers EF374379–EF375478. Note that these sequences were aligned for each individual subject, but were not aligned across individuals. Phylogenetic analysis requires aligned sequences, both within and across individuals, and a file containing the alignment for all pol sequences is available upon request from ART.
The same technique was used for sequencing the C2–V5 regions of the env gene. The first round primers were ED12C (AGTGCTTCCTGCTGCTCCCA) and ED31C (CCATTACACAGGCCTGTCCAAAG) and the second round primers used were DR7C (TCAACTCAACTGGTCCAAAG) and DR8C (CACTTCTCCAATTGTCCCTCA) that yield data on 694 nucleotides in the aligned sequences. The env sequences generated are available through Genbank, Accession Numbers EU040366–EU041600. Note that these sequences were aligned for each individual subject, but were not aligned across individuals. Phylogenetic analysis requires aligned sequences, both within and across individuals, and a file containing the alignment for all env sequences is available upon request from ART. Because the sequences are very similar within the monophyletic clusters, our principal concern was the alignment across clusters. To check the quality of this alignment, representative sequences were chosen from the monophyletic clusters and assessed for alignment quality using the program ClustalX . For pol, the low quality sites were highly scattered, indicating an overall excellent alignment with no problematic blocks. For env, there were two clusters of low quality alignment, one of 29 nucleotides in length and a second of 18 nucleotides in length. Both regions were characterized by many inferred insertions or deletions. The inclusion or exclusion of these nucleotide sites had no impact on the topology of the neighbor-joining tree relative to the inferred monophyletic clusters, the only purpose for which this tree was used. The env and pol neighbor-joining trees are available in additional files 1 and 2.
Inference criteria for multiple, coinfection and superinfection
All the pol sequence data from all participants and all visits were used to construct a neighbor-joining tree for the pol gene using PAUP* , and likewise all the env sequence data from all participants and all visits were used to construct a neighbor-joining tree for the env gene. The program ModelTest  was used to fit the nucleotide data to a substitution model, and for both env and pol, the best fitting model using the Akaike criterion was TVM+I+G (a transversional model with unequal base frequencies, some invariant sites, and rate variation among sites). Our only use of these neighbor-joining trees was to test for monophyletic clusters. As to be described, all the monophyletic clusters in these data were separated by multiple mutations (a minimum of 31) that yield extremely long branch lengths in the neighbor-joining trees that would be easily detected by any clustering technique. As will also be described, we did not use neighbor-joining to infer the evolutionary trees within a monophyletic cluster but rather used the Bayesian procedure of statistical parsimony.
An individual subject was regarded as having only a single source infection if both the pol and env sequences defined a single monophyletic cluster in the respective multi-subject neighbor-joining trees. Additional analyses were performed if one or both genes from a specific subject defined two or more disjoint clusters (polyphyly) within the multi-subject neighbor-joining tree(s). When polyphyly was detected, a tree was constructed that forced all the sequences from a single subject to be monophyletic, and the Templeton test option [29, 30] in PAUP*  was used to test the null hypothesis that the polyphyletic tree was not significantly different from the monophyletic tree. When sequences are forced to be monophyletic, long branches are created in the trees to explain the enforced monophyly. Homoplasy (multiple mutational hits at the same nucleotide site that cause reversals and/or parallelisms) are very common in HIV data, and long branches tend to be underestimated in length preferentially by parsimony when homoplasy is common. Because the Templeton test acquires greater statistical power as the estimated branch length increases, the high levels of homoplasy typical of HIV data sets means that the Templeton test will be a statistically conservative test of monophyly.
As discussed previously, 18 sequences were regarded as possible contaminates and excluded from this analysis of polyphyly. Multiple infection was inferred only when two or more distinct polyphyletic clades (branches) existed within an individual such that at least two clades contained two or more haplotypes for one or both genes.
Multiple infections detected on the first visit were regarded as potential coinfections, and all other cases of multiple infection were regarded as superinfections. As all of the participants were already HIV positive at baseline, it is possible that some of the potential coinfected cases were actually superinfections. Hence, our estimate of coinfection may be biased upwards and our estimate of superinfection may be biased downwards. This also means that all tests of heterogeneity between coinfected and superinfected individuals will be biased in favor of the null hypothesis of homogeneity.
Recombination between the pol and env genes in multiple-infected individuals was inferred when only one of these genes resulted in polyphyly. Recombination within the pol sequences and within the env sequences was inferred by the method of Crandall and Templeton  as modified by Templeton et al. . This method was specifically developed for detected recombination in HIV . Separate evolutionary trees for the pol and env sequences of all the haplotypes (unique sequences) found in a single individual over all visits were estimated using statistical parsimony  with the program TCS . The haplotype tree represents the null hypothesis of no recombination. Individual mutational transitions that appear on multiple branches (homoplasies) in the tree may be the result of recurrent mutation or recombination. Recombination as a cause of homoplasy can be distinguished from recurrent mutation because homoplasies caused by recombination are physically clustered in the sequence. This results in spatially contiguous runs of homoplasies in the tree. A runs test [implemented in a Mathematica  program available by request from ART] is used to test the null hypothesis of no association between homoplasies and physical location in the DNA or RNA region. Recombination is only inferred when the runs test is statistically significant at the 5% level or less. This procedure identifies both the putative recombinant and its parents and localizes the interval in which recombination occurred. This test is particularly appropriate for HIV sequence data, which is strongly affected by mutational homoplasy and selection. The run test is conditioned upon the topology of the tree and depends only upon the clustering of homoplasies on a single branch that are also physically clustered in the nucleotide sequence. The selection that has been documented in HIV sequence data is not associated with such close physical clustering , and most tests of selection are sensitive to frequencies of SNPs or haplotypes, which do not enter into this statistic at all. Moreover, high levels of homoplasy often cause loops in the statistical parsimony tree, which represent phylogenetic ambiguities. However, when tracing runs through such loops, the resulting set of runs is invariant to how the loop is traversed and depends only upon the nucleotide differences between the sequences at the end-points of the run.
RT-PCR can also induce recombination during sequence amplification . To focus only on recombination events that occurred naturally within an infected subject, we excluded all those recombination events that were identified by only a single recombinant sequence, which always had to be located at the terminus of a branch in the evolutionary tree of haplotypes. We regarded as true recombination only those events from which a monophyletic branch (clade) evolved that contained two or more sequences in the evolutionary tree of haplotypes.
The null hypothesis of no association between two binary categorical variables was tested with a Fisher's Exact Test, as implemented in the program StatXact 7.0 (Cytel Software Corporation). Homogeneity of recombination rates over various classifications was also tested with an exact test with StatXact 7.0. An exact logistic regression was performed with the program LogXact 7.0 (Cytel Software Corporation) to investigate the impact of IDU status, multiple infection status, and gene upon recombination.
Differences in proportions were tested with an arcsin, square root transformation corrected for small sample size  as implemented in a Mathematica  program available by request from ART. Comparisons between various groups of participants for viral load and CD4+ cell counts were executed in Excel (Microsoft) using a two-tailed t-test without assuming equal variances.
Because our sample design is fuller than that of most previous surveys for multiple infections, we also analyze subsamples of our data in order to compare our results to previously published results. In some cases our subsample is based on a stratifying variable, such as a subsample based upon having only pol sequence data. In such cases, we simply estimate the rate of multiple-infection from our data using only the information gained from pol data strata and ignoring the env sequence data strata. In other cases, we form a subsample at random. For example, to simulate what we have found if we only had cross-sectional data, we calculate the rate of multiple infection that we would have observed by using the data from only one randomly chosen visit per subject. Other subsamples reflect a mixture of these stratifying and random subsamples; e.g., a sub-sample that simulates a cross-sectional study done only with pol.