Mapping of positive selection sites in the HIV-1 genome in the context of RNA and protein structural constraints

Background The HIV-1 genome is subject to pressures that target the virus resulting in escape and adaptation. On the other hand, there is a requirement for sequence conservation because of functional and structural constraints. Mapping the sites of selective pressure and conservation on the viral genome generates a reference for understanding the limits to viral escape, and can serve as a template for the discovery of sites of genetic conflict with known or unknown host proteins. Results To build a thorough evolutionary, functional and structural map of the HIV-1 genome, complete subtype B sequences were obtained from the Los Alamos database. We mapped sites under positive selective pressure, amino acid conservation, protein and RNA structure, overlapping coding frames, CD8 T cell, CD4 T cell and antibody epitopes, and sites enriched in AG and AA dinucleotide motives. Globally, 33% of amino acid positions were found to be variable and 12% of the genome was under positive selection. Because interrelated constraining and diversifying forces shape the viral genome, we included the variables from both classes of pressure in a multivariate model to predict conservation or positive selection: structured RNA and α-helix domains independently predicted conservation while CD4 T cell and antibody epitopes were associated with positive selection. Conclusions The global map of the viral genome contains positive selected sites that are not in canonical CD8 T cell, CD4 T cell or antibody epitopes; thus, it identifies a class of residues that may be targeted by other host selective pressures. Overall, RNA structure represents the strongest determinant of HIV-1 conservation. These data can inform the combined analysis of host and viral genetic information.


Background
The HIV-1 genome is highly polymorphic, for several reasons. Firstly, different cross-species transmission events gave rise to different viral lineages in humans [1]. In addition, intrinsic characteristics of the virus, such as its short generation time, and lack of proofreading activity of the reverse transcriptase further increase genetic variability [2]. The virus is capable of genomic recombination, and most of its proteins tolerate coding variation [3,4]. Based on this genetic diversity, HIV-1 can be classified into several types, groups, and subtypes [5].
Theoretically, every single mutation at every position in the genome is generated every day. However, most of the resulting virions are not viable, and various layers of conservation (RNA and protein structure and use of overlapping coding frames) may effectively constrain the level of genomic variability [6,7]. On the other hand, there are recognized pressures that target the virus resulting in escape and adaptation (pressure exerted by the immune system or by antiviral treatment and bottleneck events such as transmission) [8,9]. These opposing forces need to be considered for a correct understanding of evolution of the viral genome.
Here we aim at generating a comprehensive map of the HIV-1 genome including information on conservation, positive selective pressure, and structural constraints, which has the potential to serve as a reference for understanding the limits to viral escape, and for the discovery of sites of genetic conflict between viral and host proteins.

Results
The nature of the conserved genome Although the HIV-1 genome is highly variable, 67% of amino acid positions were found to be conserved. We built a map of the genome that shows the association between conservation and structural constraints: protein and RNA structure, and presence of overlapping coding frames ( Figure 1). Seventy-two percent of residues in structured RNA regions and 74% of residues in structured protein regions (67% in β-sheets and 81% in αhelices) were conserved. The need to use overlapping reading frames is generally thought to be another level of constraint of viral diversity. However, in this dataset, only over half of the residues in these regions were conserved (56%). Univariate statistical analyses (Table 1, section A) indicated that RNA secondary structure and α-helix domains effectively limit viral variability (OR 1.29 (1.05-1.6), p = 0.02 and 1.52 (1.17-1.98), p = 0.002, respectively). In contrast, β-sheet domains and overlapping reading frames showed increased levels of variability (OR 0.74 (0.56-0.97), p = 0.03; and OR 0.55 (0.45-0.68), p = 7.3E-09, respectively).
We also performed a similar analysis for each gene separately. Secondary RNA structure played variable roles depending on the HIV-1 genes: structured RNA regions were associated with conservation for RT and gp41, but with variability in gag, while flexible RNA regions were associated with increased variability in gp120. Regarding protein structure, only α-helix domains in gag were associated with conservation; no association was found for β-sheets. Overlapping reading frames were associated with more variability in gag, tat and gp41. However, some of the gene-level subanalyses were limited by the length of the gene or the number of informative sites, and thus by the statistical power.
The nature of genomic regions under positive selective pressure Evolutionary pressures are strong driving forces for viral diversity. Twelve percent of the viral genome were found to be under positive selective pressure in the study dataset. It is believed that selective pressure predominantly acts on viral epitopes that are targets for the host immune system. We therefore mapped sites under positive selection, together with sites within known CD8 T cell, CD4 T cell, and antibody epitopes on the HIV-1 genome (Figure 1). We found that 11% of residues in CD8 T cell epitopes, 13% of residues in CD4 T cell epitopes and 18% of residues in antibody epitopes were under positive selection. Univariate analysis showed a significant association between sites under positive selection and antibody epitopes (OR 1.73 (1.28-2.32), p = 0.0005), but not CD8 or CD4 T cell epitopes (Table 1, section B).
Analyses were also performed for each gene separately. We confirmed the association of sites under positive selection and antibody epitopes for gp120. While associations with CD8 T cell and CD4 T cell epitopes were not significant in the genome-wide analysis, we did find significant associations for specific genes: associations with positive selection were observed in gp41 for CD4 T cell epitopes, and in gag and gp120 for CD8 T cell epitopes.

Combining different layers of data
Both constraining and diversifying forces shape the viral genome, and these forces are also interrelated. We therefore built multivariate models using variables from both classes of forces. The statistical results are shown in Figure 2 and in Additional file 1, Table S1.
Conservation was governed by structured RNA (OR 2.09 (1.43-3.12), p = 0.0002) and domains located in α- Figure 1 Map of the HIV-1 genome. For clarity, the genome is represented as linear, with the genes represented as a concatemer (top bar). The following layers of data are shown: Conservation: black = amino acid conservation less than 95%. RNA: dark blue = extensively structured RNA (SHAPE parameter < 0.25), light purple = flexible RNA (SHAPE > 0.5). Protein structure: blue = structured (β-sheet or α-helix), grey = no structural information available for vif, vpr, tat, rev, vpu and nef. Overlapping region: green = sites in overlapping reading frames. Positive selection: dark purple = sites under positive selection. CD8 T cell epitope: pink = CD8 T cell epitope, CD4 T cell epitope: light pink = CD4 T cell epitope, AB epitope : red = antibody epitope, AA and AG enrichment: orange = regions enriched in AA and AG dinucleotide motives.
We also performed this analysis for each gene separately (Additional file 1, Table S1). Conservation was associated with RNA structure in RT, gp120 and gp41. CD8 T cell epitopes in tat and gp120 were associated with increased variability. We also confirmed the association of variability with flexible RNA regions in gp120, with CD4 T cell epitopes in gp120 and with antibody epitopes in tat and gp120.
Similarly, we performed a multivariate analysis for the prediction of positive selection. Genome-wide, we found less positive selection in structured RNA regions (OR 0.3 (0.14-0.58), p = 0.001) and in α-helix domains (OR 0.65 (0.42-1), p = 0.05). This is consistent with the increased conservation observed in these regions. On the other hand, we found an increased number of residues under positive selection in overlapping regions (OR 7.75 (1.08-37.18), p = 0.02), in CD4 T cell epitopes (OR 1.76 (1.23-2.52), p = 0.002) and in antibody epitopes (OR 3.5 (2.28-5.37), p = 1.26E-08), also consistent with the significantly increased variability in these domains ( Figure 2B). As an additional layer of variation, we looked at sites enriched in AG and AA dinucleotide motives in a genome-wide positional screen. The three defined patches (Figure 1) enriched in AG and AA dinucleotide motives were associated with sites under positive selection (OR 2.12 (1.31-3.35), p = 0.002).
Individual gene analyses confirmed the enrichment of positively selected sites in pro and tat, in overlapping reading frames and in gp120 and nef antibody epitopes. Some genes did not follow the genome-wide trends; CD4 T cell epitopes were associated with conservation in gag, and CD8 T cell epitopes, with conservation in gag, gp41 and nef. Although CD8 T cell epitopes were not associated with positive selection genome-wide, we found opposing results for tat (having fewer residues under positive selection), and gp120 (increased number of residues under positive selection). Similar opposing results were obtained for CD4 T cell epitopes: when the gene analysis was completed, tat and rev had less positively selected residues, while gp120 and gp41 had more regions under positive selection.
The analyses could be biased if they included multiple consecutive strains from the same individual, or from transmission chains. Because of difficulties retrieving sequence information automatically, we chose to A B Figure 2 Multivariate analysis of constraining and diversifying forces shaping the viral genome. For each variable, the odds ratio and 95% confidence interval are shown for a multivariate model predicting conservation (A) or positive selection (B). The vertical line shows the null hypothesis (OR = 1). Two separate analyses were performed, one including all variables (blue) and one excluding protein structure, not available for 45% of the genome (black). OR = odds ratio, CI = confidence interval.

Distribution of signals across the genome
We assessed whether conservation and positive selection signals were equally distributed across the genome. Conservation was 80% in the 5' half of the genome versus 55% in the 3' half of the genome (OR 0.29 (0.25-0.34), p < 0.0001). Correspondingly, positive selection was observed in 7% of residues in the 5' half of the genome versus 18% in the 3' half of the genome (OR 0.32 (0.25-0.40), p < 0.0001).
All data used in this study, indexed to the HXB2 reference sequence, are available from the authors upon request.

Discussion
We have constructed an in-depth map of the HIV-1 genome that presents the landscape of genetic variation in the context of several levels of structural and immunological constraints. Over two-thirds of the viral genome and proteome are conserved. Conservation is strongly determined by RNA structure and, at the protein level, by the need to maintain α-helix domains. On the other hand, 12% of the genome are under positive selection, with an enrichment of sites observed in CD4 T cell and antibody epitopes. Previous studies advanced the understanding of protein [7] or RNA [10] structural constraints on viral genome diversity, or on immune selective pressures [11,12]. Here, we examined the viral genome under the paradigm of multiple layers of information.
Genomes of single-stranded RNA viruses contain important structures that support internal ribosome entry sites, packaging signals, pseudoknots, transfer RNA mimics, ribosomal frameshift motifs, and cis-regulatory elements. Watts et al. [10] used high-throughput SHAPE to interrogate nucleotide flexibility in the HIV-1 genome, as well as estimates of pairing probability at each nucleotide. This approach led to the identification of 10 regions that exhibit both low SHAPE reactivity and high pairing probability. Most genome regions with low SHAPE reactivity were shown to associate with a regulatory function. They proposed a model in which, in addition to the linear relationship between RNA and protein primary sequences, there is a second level of higher order RNA structure that directly modulates ribosome elongation, thus influencing native protein folding and tertiary structure. Although the present study uses data on SHAPE reactivities derived from a single viral strain, Watts et al. [10] compared the empirical data with evolutionary base-pairing probabilities predicted using an alignment of non-recombinant group M subtype sequences from the Los Alamos database, and found that only four regions were in disagreement. Overall, the present study underscores that this novel component of the genetic code represents the strongest determinant of conservation.
Our study also indicates that protein structure, specifically α-helix domains, is associated with conservation. The α-helix is the most important and stable structural element in proteins [13]. In contrast, more variation can be accommodated by β-sheets. Importantly, both layers of constraint, RNA and protein structure independently determine conservation and limit viral escape from selective pressures. These results confirm findings by Sanjuan et al., describing an association between SHAPE reactivity and second-codon position diversity (as a measure for protein sequence variation) and nonsynonymous substitution rates (as a measure for selective pressure) [14].
The non-conserved viral genome encompasses two classes of sites, variable residues under relaxing constrain, and sites that are identified as being under positive selection, indicating higher fitness in a given host environment. We investigated three canonical selective forces of adaptive immunity: CD8 T cell, CD4 T cell and antibody responses. The results identify pressures that reflect population effects; i.e., a number of hosts share both the selective factor (the host factor) and the direction of the selecting force (escape). Here, an association was established for CD4 T cell and antibody responses and positive selection at cognate epitopes. Escape from antibody responses has clearly been demonstrated to occur from the very earliest phases of HIV-1 infection [15,16]. Escape for CD4 T cell responses is far less clear cut; however, the relationship between CD4 T cell help and the maturation of the antibody response [17] may certainly contribute to the association with positive selection at cognate epitopes. However, the undisputed relevance of CTL actionwidely associated with viral escape [16,18,19], was not identified at population level polymorphism. Our interpretation is that the diversity of restricting alleles in the human population [20], the large proportion of sites in the viral genome identified as coding for CD8 T cell epitopes, and the diversity of fitness consequences of escape at the different CD8 T cell epitopes, fail to create a local signature of positive selection that can be identified in the viral genome at the population level. In addition, Irausquin and colleagues [12] indicated that many nonsynonymous mutations in both CD8 T cell and CD4 T cell epitopes are subjected to conflicting evolutionary pressures, with positive selection favoring escape mutations within hosts expressing the respective presenting HLA molecule and purifying selection acting to remove them in the population at large. Another explanation could be that escape mutants without deleterious effects become quickly fixed in the population so that these epitopes are relatively conserved [11].
Overlapping reading frames are generally thought to be evolutionarily stable and to be conserved, as mutation in one frame can negatively affect the second gene [21]. However, we identified low conservation and positive selection in those regions. An important caveat, however, is that there are currently no methods available to identify reliably site-specific positive selective pressure in these regions [22]. Thus, the approach we used may overestimate positive selection in overlapping reading frames. We also considered a possible contribution of hypermutation to structural constraints and positive selection. Only three strains were identified as being hypermutated. The analysis also explored the distribution of APOBEC3G/F editing across the genome. Three patches showed enrichment in AG and AA dinucleotide motives in a genome-wide positional screen. Although they were associated with sites under positive selection, it is not possible to establish a link of causality between deamination and genome evolution.
The joint analysis of different sets of information generated a comprehensive view of the complete genome. However, gene-specific analyses showed instances of departure from the genome-wide estimates. For example, CD8 T cell epitopes were generally well conserved, except in gp120, where they were enriched for sites under positive selective pressure. Similarly, CD4 T cell epitopes were enriched for sites under positive selection genome-wide, but these epitopes were significantly more conserved in gag. Thus, the various constraints and selective pressures do not act evenly across the genome.
Overall, the present study extends previous analyses by using a larger curated dataset of near-complete subtype B genome sequences to jointly analyze different conservation and evolutionary forces, The study by Sanjuán et al. on the interplay between RNA structure and protein evolution [14] used a hundred sequences and excluded tat and rev from the analysis. The study by Irausquin et al., on T cell epitopes [12] used between 46 and 599 sequences (depending on the gene) and did not include gp41 and gp120; the genes with the strongest signals of positive selection in the present study. The study by Woo et al., [7] analyzed the relationship between protein structure and evolutionary pressures on HIV-1 gag and env proteins by studying solvent-accessibility as a measure for protein structure, and Shannon entropy as a measure for protein diversity. They found a clear relationship between dN/dS ratio and the solventaccessibility of the residues in the protein, with surface amino acids being under positive selection, and buried amino acids under purifying selection. They found no relationship between variability (as measured by Shannon entropy) and protein structure (helix or strand). However, our results point to an association between conservation and domains in α-helices, but only in the genome-wide analysis. The recent identification of multidimensional constraints on HIV-1 Gag evolution [23] also points towards analyses that could benefit from the layers of information included in the current study with the goal of better identifying regions of immunological vulnerability.

Conclusions
The global map of the viral genome can inform models on the possible evolutionary trajectories of the virus. It also identifies positively selected sites that are not in canonical CD8 T cell, CD4 T cell or antibody epitopes, indicating a class of residues that may represent unrecognized epitopes or that are subject of other host selective pressures, such as innate immune effectors. As an example, sequence adaptation has been observed in the viral capsid in rhesus macaques upon cross-species transmission of SIVsm, due to selective pressure provided by restrictive TRIM5 alleles [24]. Recently, Alter et al. [25] identified killer immunoglobulin-like receptors (KIR)-associated amino-acid polymorphisms in the HIV-1 sequence that indicate that KIR-positive natural killer cells can place immunological pressure on the virus. In addition to informing the combined analysis of host and viral genetic information [26], the results of this study may help reveal novel mechanisms of antiretroviral response.

Methods
As of May 2011, 2723 full genome sequences were reported in the Los Alamos database http://www.hiv. lanl.gov/content/index, of which 1066 belonged to subtype B. We built a study dataset of 635 full genome subtype B sequences that were on average 9031 nucleotides long and had no large insertions upon visual inspection of the alignment in comparison to the reference