An integrated map of HIV genome-wide variation from a population perspective

Background The HIV pandemic is characterized by extensive genetic variability, which has challenged the development of HIV drugs and vaccines. Although HIV genomes have been classified into different types, groups, subtypes and recombinants, a comprehensive study that maps HIV genome-wide diversity at the population level is still lacking to date. This study aims to characterize HIV genomic diversity in large-scale sequence populations, and to identify driving factors that shape HIV genome diversity. Results A total of 2996 full-length genomic sequences from 1705 patients infected with 16 major HIV groups, subtypes and circulating recombinant forms (CRFs) were analyzed along with structural, immunological and peptide inhibitor information. Average nucleotide diversity of HIV genomes was almost 50% between HIV-1 and HIV-2 types, 37.5% between HIV-1 groups, 14.7% between HIV-1 subtypes, 8.2% within individual HIV-1 subtypes and less than 1% within single patients. Along the HIV genome, diversity patterns and compositions of nucleotides and amino acids were highly similar across different groups, subtypes and CRFs. Current HIV-derived peptide inhibitors were predominantly derived from conserved, solvent accessible and intrinsically ordered structures in the HIV-1 subtype B genome. We identified these conserved regions in Capsid, Nucleocapsid, Protease, Integrase, Reverse transcriptase, Vpr and the GP41 N terminus as potential drug targets. In the analysis of factors that impact HIV-1 genomic diversity, we focused on protein multimerization, immunological constraints and HIV-human protein interactions. We found that amino acid diversity in monomeric proteins was higher than in multimeric proteins, and diversified positions were preferably located within human CD4 T cell and antibody epitopes. Moreover, intrinsic disorder regions in HIV-1 proteins coincided with high levels of amino acid diversity, facilitating a large number of interactions between HIV-1 and human proteins. Conclusions This first large-scale analysis provided a detailed mapping of HIV genomic diversity and highlighted drug-target regions conserved across different groups, subtypes and CRFs. Our findings suggest that, in addition to the impact of protein multimerization and immune selective pressure on HIV-1 diversity, HIV-human protein interactions are facilitated by high variability within intrinsically disordered structures. Electronic supplementary material The online version of this article (doi:10.1186/s12977-015-0148-6) contains supplementary material, which is available to authorized users.


Figure S 4:
Geographic distribution of HIV-1 genomic diversity. Countries with no sequences available (NA) are colored white. Amino acid diversity in individual countries was mapped onto the global cartographic map in Natural Earth V2.0.0 (http://www.naturalearthdata.com/). Countries with their infections in different groups or subtypes had higher genomic diversity, and the highest being found in Central Africa. Our results are consistent with the reported distribution of HIV-1 subtypes described in [2], implying that strains included in our study may capture the global HIV-1 diversity.        ) and mapped GP120 peptide-derived inhibitors. On the structure, CD4 and Fab 48d structures are colored orange and pink, respectively. The GP120 and peptide inhibitor sequences are annotated beneath the protein structures. Peptide inhibitors are mapped to the GP120 functional domains (bottom), including 5 variable domains (V1-V5) and 5 conserved domains (C1-C5) [4].
The V1 to V3 and V5 loops have been identified as the minimal functional units of GP120 to mediate CXCR4-dependent infection [4]. The V3 loop is the major target for neutralizing antibodies and V3-derived peptides offer promising anti-HIV activities [5]. GP120-derived peptides can inhibit the interactions between GP120 and T-cell surface glycoproteins (e.g. CD4, CD19), chemokine co-receptors (e.g. CCR5, CXCR4) and monoclonal antibodies [5]. The inhibition activity of GP120-derived peptides is straindependent and cell-dependent (Table S4). GP41 forms a trimeric spike with GP120 on the surface of HIV particles ( Figure 4). Three C-terminal heptad repeats (CHRs) bind with three N-terminal heptad repeats (NHRs) to form a 6-helix bundle, which fuses the particle with cellular membrane to create a fusion pore during the viral entry [6,7]. Peptides derived from either NHR or CHR can mimic the viral structures to prevent viral entry. T20 (Enfuvirtide, Fuzeon, DP178), derived from NHR of HIV-1 subtype B strain LAI, is an 36AA L-peptide inhibitor which targets multiple sites on GP41 and GP120 [8]. Most GP41 peptide inhibitors have been derived from the conserved pre-hairpin structure [9]. GP41-derived L-peptides have many defects such as proteolytic degradation, limited potency and toxicity [10]. To overcome the defects of L-peptides, D-peptides have been proposed as promising fusion inhibitors (see reviews in [11,12]). For instance, D-peptide inhibitors PIE12 [13] and IQN17 [14] have shown promising anti-HIV activities by mimicking the GP41 heptad repeat regions. The HIV-1 Integrase structure has an N-terminal domain (NTD), a catalytic core domain (CCD) and a C-terminal domain (CTD), connected by flexible links [3]. Integrase plays multiple roles during the viral reverse transcription and integration [15]. The key functional roles of Integrase is to insert viral dsDNA into human chromatins, creating a viral reservoir for viral infection [16]. Integrase-derived peptides can inhibit Integrasemediated catalytic functions, Integrase inter-domain interactions and/or Integrase-human protein interactions (Table S4). Peptide inhibitors derived from CCD can inhibit Integrase dimerization, 3'-end DNA processing and strand transfer during the viral integration [17]. Reverse transcriptase forms a heterodimer to synthesize HIV dsDNA from the viral genomic RNA [18]. RT structures are comprised of the finger, palm, thumb, connection, RNaseH and P51 functional domains [19]. Peptide inhibitors derived from the connection (Pep-7 [20], Peptide1 [21]) and the thumb domain (P24 [22], P27 [22], PAW [22]) can block the dimerization of p66 and p51. Nanoparticle systems can improve the delivery of RT peptide inhibitors [22]. Beta-sheets of the N-terminal and C-terminal domains are crucial for protease dimerization [23,24]. HIV-derived peptide inhibitors that mimic the N-and C-terminal domains have been investigated as potential protease inhibitors. These include the cross linked interfacial peptide PF1 [25] and the PR-derived peptide p-S8 [26]. Peptides derived from protease positions 83-93 can also inhibit the protease folding [27][28][29]. The regulatory protein Tat can bind with GP120 to enhance the viral entry [30]. Peptide sequences derived from Tat positions 48-57 can interrupt the Tat-GP120 interaction in a concentration-dependent manner [30]. The peptide inhibitor Tat11 can interrupt nuclear import by interacting with the host importin beta protein [31]. Moreover, the Tatmediated transcription can be inhibited by interrupting Tat-TAR interactions, which involves the arginine rich motif of Tat and the 3-nt bulge of the TAR RNA hairpin (U23, A27, U38) [32]. An interaction between Vpr and RT has not been reported, nor an interaction between Vpr and Integrase. However, peptide inhibitors derived from Vpr domains (positions: 57-71, 61-75) can interfere with the activity of both RT and Integrase [33]. Two studies have shown that Vpr-derived peptides (positions: 55-69, 60-74) can inhibit the strand transfer and the 3'-end-processing reactions performed by Integrase [34,35]. Rev can target the Rev response element (RRE) in the viral RNA genome during nuclear export, while Rev-derived peptides can interrupt the Rev-RRE interaction [36]. Rev can physically bind with Integrase to form a pre-integration complex so that viral integration can be postponed until the completion of nucleocytoplasmic shuttling [37,38]. Two Revderived peptides (positions: 1-30, 49-74) can inhibit the Integrase 3'-end processing and the strand-transfer in cell-free assays [37]. Moreover, direct interactions between two Rev domains (positions: 12-23, 53-67) and Integrase domains (positions: 118-128, 66-80) have been reported [39]. Two shorter Rev peptides (positions 13-23, 53-67) have also shown the inhibitory activity [40]. The Integrase-derived peptides INr-1 and INr-2 can stimulate viral genome integration and interrupt the Rev-Integrase protein interaction [41]. Capsid pentamers and hexamers constitute the internal shell of viral particles [42]. The alpha-helical structure of the C-terminal domain (CTD, positions: 146-231) participates in the capsid multimerization [43]. Peptide inhibitors derived from the CTD can interrupt the multimerization of HIV-1 Capsid by mimicking the capsid multimerization interfaces. The peptide inhibitor CAC1 derived from the CTD domain can disassociate CTD dimers (Kd = 50 uM) [44]. Since peptide inhibitors must penetrate the viral membrane to prevent Capsid multimerization, cell-penetrating peptides have been designed to improve the peptide potency in cell culture experiments [45,46]. For instance, peptide inhibitor CAI [47] has been converted into a cell-penetrating peptide NYAD-1, which improves the binding affinity and inhibits the post entry stage [45]. Cell-penetrating peptides NYAD-201 and NYAD-202 have shown promising anti-HIV activities [46]. Vif-derived peptide Vif41-65 (positions: 41-65) can inhibit the protease activity [48]. Vif positions (36,47,101,117,124) are associated with PI treatment [49]. Two Vif-derived peptides 30-65 and 78-98 have also been shown to inhibit the protease activity [50]. The N terminus of Protease (positions: 1-9) interacts with the central domain of Vif (positions: 78-98) [51]. Two Vif-derived peptides (positions: 81-88, 88-98) can inhibit the protease activity [50,52].

Figure S 22:
Distribution plots of amino acid diversity between the consensus and the circulating genomes (blue), and within circulating genomes (black). X-axis indicates HIV-1 subtypes B, A1, C, D, CRF01_AE and CRF02_AG, each of which contains more than 50 sequences in our datasets. Y-axis indicates the amino acid genomic diversity. For each subtype, the consensus sequence is obtained by retaining the most prevalent amino acid at each position. For the 6 major HIV-1 subtypes and CRFs (A1, B, C, D, 01_AE, 02_AG), the average amino acid diversity between circulating strains (12.3 ± 1.5%) was significantly higher than that between the consensus and the circulating strains (8.3 ± 1.3%, P-value < 0.001).

Figure S 23:
Similarity of prediction results between the consensus and the 9 protein secondary structure prediction methods. Consensus predictions were obtained using the majority voting strategy among 9 individual methods. Given 15 HIV-1 proteins in the full-length genome of HIV-1 subtype B, similarities between two methods were calculated by the percentages of common predictions as alpha-helix (top-right part of matrix) and beta-strand (left-bottom part of matrix) structures.

Figure S 24:
Prediction similarities of the consensus and 17 methods for protein intrinsically disorder prediction. Prediction similarities were calculated by the percentages of common predictions of ordered (disorder tendency score < 0.5) or disordered (disorder tendency score ≥ 0.5) positions in the HIV-1 protein structures. Consensus predictions were obtained using the majority voting strategy among the 17 individual methods. The consensus method has the highest average prediction similarities compared to the other methods.