APOBEC3G and APOBEC3F rarely co-mutate the same HIV genome
© Ebrahimi et al.; licensee BioMed Central Ltd. 2012
Received: 14 August 2012
Accepted: 11 December 2012
Published: 20 December 2012
Skip to main content
© Ebrahimi et al.; licensee BioMed Central Ltd. 2012
Received: 14 August 2012
Accepted: 11 December 2012
Published: 20 December 2012
The human immune proteins APOBEC3G and APOBEC3F (hA3G and hA3F) induce destructive G-to-A changes in the HIV genome, referred to as ‘hypermutation’. These two proteins co-express in human cells, co-localize to mRNA processing bodies and might co-package into HIV virions. Therefore they are expected to also co-mutate the HIV genome. Here we investigate the mutational footprints of hA3G and hA3F in a large population of full genome HIV-1 sequences from naturally infected patients to uniquely identify sequences hypermutated by either or both of these proteins. We develop a method of identification based on the representation of hA3G and hA3F target and product motifs that does not require an alignment to a parental/consensus sequence.
Out of nearly 100 hypermutated HIV-1 sequences only one sequence from the HIV-1 outlier group showed clear signatures of co-mutation by both proteins. The remaining sequences were affected by either hA3G or hA3F.
Using a novel method of identification of HIV sequences hypermutated by the hA3G and hA3F enzymes, we report a very low rate of co-mutation of full-length HIV sequences, and discuss the potential mechanisms underlying this.
Human apolipoprotein B mRNA-editing enzyme, catalytic polypeptide-like 3G and 3F (hA3G,F) are enzymes important in the control of viral growth [1, 2]. They are known for their ability to block HIV infection in the absence of the virion infectivity factor (Vif) . The mechanisms of inhibition proposed for these proteins can be classified as either cytosine deamination-dependent  or deamination-independent . In the former, one or more hA3G and/or hA3F molecules are trafficked into a nascent virion and are released along with the viral RNA into the cytoplasm of a newly infected cell. The HIV RNA is reverse transcribed in the host cell to create a DNA minus strand followed by degradation of the original RNA and its replacement by a DNA plus strand. During the degradation of the viral RNA, regions of the DNA minus strand remain transiently unpaired. These single stranded DNA regions are targeted by hA3G and/or hA3F . These enzymes mutate cytosine to uracil within specific sequence motifs. For example hA3G and hA3F preferentially target C within the context of CC and TC, respectively (the underlined cytosine is deaminated to uracil). The uracils in the minus strand pair with adenosines in the plus strand. Therefore the action of hA3G, F results in the replacement of G by A in the plus strand and subsequently in the HIV RNA genome. The G-to-A mutation hotspots are GG and GA for hA3G and hA3F, respectively [7, 8]. Usually the HIV sequences targeted by these enzymes show G-to-A mutations in multiple positions; therefore are referred to as hypermutated sequences .
It is known that hA3G and hA3F are widely co-expressed in the human cells [10, 11] and co-localize in the mRNA processing (P) bodies . They might also co-package into nascent HIV virions, thus acting cooperatively to inhibit HIV infection [10, 11]. Despite this, most of the studies so far have concentrated on the impact of individual APOBEC3 proteins. Therefore, the collective impact of these proteins on the HIV genome is not clear. The limited studies of the effect of both proteins have returned contradicting results. Analysis of hypermutated sequences from the HIV-1 pol , vpu/env , gag  and also from the near complete viral genomes  points to the substantial domination of either hA3G or hA3F. However, lightly mutated sequences with footprints of both proteins have also been reported . In addition, analysis of fractional env/nef and in a different study, vif, gag and env  have shown sequences with mutations within both GG and GA motifs. However, given short sequence reads and the limitations of current identification methods, it can often be hard to identify the target motif preferences of hypermutation. The aim of this study is to find out what proportion of the in vivo full genome HIV sequences are targeted by both proteins. To achieve this aim the first step is to accurately identify sequences that contain signatures of mutation by hA3G and/or hA3F [7, 19]. To investigate the joint impact of hA3G and hA3F, we developed a method that identifies sequences hypermutated by hA3G, hA3F or both by taking advantage of the unique context-dependency of G-to-A mutation by hA3G and hA3F. The target motifs GG (for hA3G) and GA (for hA3F) are less represented in the genomes of hypermutated sequences compared to those of normal HIV sequences. By contrast, the product motifs (AG and AA) are more represented in the hypermutated sequences compared to normal sequences. Therefore, a measure of ‘motif representation’  can be used to identify affected sequences. It is worth noting that all hA3 proteins, except for hA3G, have a dinucleotide target preference similar to that of hA3F . Therefore, any signature attributed to hA3F in this paper could well be the footprint of other hA3 proteins with a GA-to-AA mutation preference. The reason for the use of hA3F here is its greater mutagenic activity against HIV when compared to the other hA3 members with the same motif preference.
Here, we calculate the representation of hA3G and hA3F target and product motifs in 2829 HIV-1 sequences as well as in the 88 full genome HIV-1 sequences classified as “hypermutated” in the Los Alamos National Laboratory (LANL) database. These 2917 sequences are from all groups (M, O, N and P), subtypes (A, B, C, D, F, G, H, J and K) and recombinant forms (e.g. 01_AE, A1D, A2C, A1DK, A1A2D, 21_A2D, 02_AG, 04_cpx, 05_DF, BF and the like) that have been reported in the database. The results showed only one sequence, out of approximately one hundred hypermutated sequences, with clear signatures of mutation by both hA3G and hA3F. Interestingly this sequence does not belong to any of the subtypes and recombinant forms of the HIV main (M) group and has been classified as an outlier (O) HIV-1 sequence. Analysis of in vitro hypermutated sequences as well as simulated hypermutation data further confirmed the efficiency of the method based on motif representation in identification of co-mutated sequences.
A list of seven HIV-1 sequences with α >> 99.9% and two HIV-1 sequences with α < 50%
N or LM
N or LM
Among those sequences tagged as hypermutated in LANL database, we identified two sequences (shown by filled triangles under the 95% confidence interval line in Figure 1b) that are different from the population of normal HIV-1 sequences at confidence levels less than 50%. This may imply that these two sequences are normal HIV-1 sequences which have been misclassified as hypermutated in the database. Another explanation could be that these are lightly mutated sequences for which it is not possible to discriminate random G-to-A mutation (by reverse transcriptase)  from context dependent G-to-A mutation (by hA3G,F) within a large population of HIV sequences. Therefore, they appear within the range of normal HIV-1 sequences. Table 1 shows the accession numbers of the sequences for which we estimate a hypermutation status different from those of the LANL database.
The following points can be inferred from this figure:
1- Type of hypermutation
The hypermutated sequences are classified into three categories shown by filled arrows. The first category contains sequences with a high DRhA3G which appear along the horizontal axis. These sequences show signs of mutation by hA3G. The second category contains sequences with a high DRhA3F which stretch along the vertical axis. These sequences have footprints of mutation by hA3F. The third category includes sequences which are high in both DRhA3G and DRhA3F. These sequences which have footprints of co-mutation by hA3G and hA3F would appear in the top right quarter of the graph. The sequence shown by number “1” is from this category.
In Figure 2 seven nominally normal sequences with α >>99.9% are shown with filled circles. Two of these sequences extend in the direction of mutation by hA3G and four in the direction of the hA3F axis, implying that they are hypermutated sequences. There is one sequence (shown by a small arrow) that appears outside the 99.9% interval but does not extend in either direction.
2- Extent of hypermutation
Within the reported sequences in the database, the number of sequences targeted by hA3G is much larger than those affected by hA3F. The extent of hypermutation by hA3G is also greater than that of hA3F as evidenced by the larger scale of the DRhA3G axis compared to that of DRhA3F. The calculated values of DRhA3G and DRhA3F can be used as a measure of the extent of mutation. In general the larger these values the more mutation the sequence has undergone.
3- Mutation by both hA3G and hA3F
The sequences which are mutated by both hA3G and hA3F would be located away from the normal sequences and between the sequences targeted by either hA3G or hA3F. The contribution of each protein in hypermutation can be determined from the location of the targeted sequence in the space of DRhA3G versus DRhA3F. Surprisingly only one sequence, shown by “1” and belonging to the outlier HIV group, exhibits a clear footprint of both proteins. However it is located closer to the line of hypermutated sequences by hA3G indicating a greater contribution from hA3G. This sequence (accession number AF407419) was isolated from an HIV patient in 1992 and contains multiple stop codons in all open reading frames. There are also two other sequences (shown by numbers 2 and 3) with slightly elevated DRhA3F values compared to those of the remaining sequences mutated by hA3G.
Within the HIV-1 sequences, there is no extensively co-hypermutated sequence with an equal or greater contribution from hA3F. A list of hypermutated sequences at >99.9% confidence level, their accession numbers and source(s) of mutation is given in the Additional file 1: Table S1.
Our analysis suggests that motif representation is a useful method for classifying HIV sequences as hypermutated. However, it is also possible that the variations in motif representation we observe arose by random chance, or non-specific mutation. Since these processes are expected to affect the positive and negative HIV strands equally, whereas hypermutation should show its signature only on the positive strand, we therefore analysed the motif representation of the negative strands. Consistent with a strand-specific mechanism of mutation, the motif ratios of the negative strands clustered much more evenly than the positive strands, with no clear evidence for context-dependent mutation. We note also that issues such as codon usage or conserved regulatory elements that may bias motifs should be equally present in negative strands [using the complement of the motif]. This confirms the utility of our method and that it is not excessively biased by these factors. The results are given in Figure S2 of the Additional file 1.
The accession numbers, DRs, T 2 values and hypermutation status of the HIV-1 sequences hypermutated in vitro by hA3G (sequences starting with 3G) or hA3F (sequences starting with 3F)
α > 99.9%
Since neither the analysis of patient sequences from the database nor in vitro hypermutated sequences identified co-mutated genomes, this raises the question of whether our method can detect co-mutated sequences if they occur. In the absence of experimental sequences that appear co-mutated, we simulated co-mutation as described below.
The motif dependent G-to-A changes by hA3G and/or hA3F results in an increase in DRhA3G and/or DRhA3F, However, random G-to-A changes by reverse transcriptase do not affect these two diagnostic ratios. As a result the HIV population generated by a random (context-independent) G-to-A mutation simulation does not differ from the original HIV population (see panels A and E of Figure 4). The results of the analysis of the simulation data imply that co-mutation by hA3G and hA3F is detectable if present and that its absence from the in vivo sequences suggests it is a rare event.
The proteins hA3G and hA3F are important enzymes in controlling viral replication. One of their mechanisms of action is the introduction of multiple mutations into the genome of the affected viruses. The genes for these two proteins are located in close proximity on chromosome 22. They are widely co-expressed in different tissues and also co-packaged into the HIV virions. This may indicate that hA3G and hA3F act cooperatively against HIV. There are many reports on the mutagenicity of each of these individual proteins, but their co-operative action is less understood . This may be partly due to the analysis of short sequences (typically < 10% of the complete HIV genome) [11, 13–15, 17, 23] within which it is not trivial to accurately identify mono- versus co-mutated signatures. Also the previous method of identification has usually been based on alignment and comparison of suspected hypermutated sequences to a consensus sequence which may or may not be the true ancestral sequence of hypermutated sequences. Here, we studied the cooperative impact of hA3G and hA3F by investigating the mutation footprint they have left on the full genome hypermutated HIV sequences. We developed a novel bioinformatics approach for accurate identification of sequences hypermutated by either or both of these two enzymes. Identification of hypermutated sequences is important as they contain information about the mutation mechanism by hA3G and hA3F. Our method identifies hypermutated HIV-1 sequences using the within sequence signature(s) of G-to-A changes by hA3G and/or hA3F. It also uniquely discriminates among G-to-A mutation by reverse transcriptase, hA3G, hA3F and both of these latter proteins. It is based on the representation of hA3G and hA3F target and product motifs in the genome of thousands of HIV sequences. The representation of each motif is quantified in such a way that it becomes independent of the frequency of its constituent mononucleotides. We have coded the algorithm into an executable Java program ‘hypersign’ that is available as an Additional file 2. Analysis of a large population of full genome HIV-1 sequences revealed that co-mutation of the same genome by hA3G and hA3F is a rare event. We performed simulations to confirm that co-mutation is detectable by our method, confirming its absence is not an artifact of the analytic approach.
Two possible mechanisms can be postulated to explain the rarity of co-mutated sequences. Firstly, a low proportion of total sequences showing any hypermutation was observed, suggesting very few virions incorporate APOBEC3 proteins . This could either arise because most virions have no co-packaged APOBEC3, but a few virions package one and some co-package two copies of the enzymes. In this scenario, it is easy to imagine that the number of copies of APOBEC3 should be Poisson distributed. So, assuming random sampling, if 12 out of 2917 genomes (see Figure 2) have at least one hA3F protein and 79 out of 2917 have at least one hA3G, then we would only expect 1×10-4 virions having at least one copy of each enzyme. Thus, given the small number of total sequences showing hypermutation, the low frequency of co-mutated sequences maybe explained simply by the low chance of incorporation of either enzyme individually.
Alternatively, hypermutation may arise because Vif fails to exert its effects in a small proportion of cells, leading to co-packaging of many copies of APOBEC3 into the genomes produced from these cells. In this case, we might expect that many genomes that contain either hA3G or hA3F would in fact have copies of both enzymes. In this case, then the lack of co-mutation may suggest either that these two proteins compete for the same HIV RNA, or that co-packaged hA3G and hA3F interfere with the enzymatic activity of one another, perhaps by forming hetero-oligomers [10–12]. An alternative possibility is that, in the presence of co-packaged hA3G and hA3F, one enzyme acting before the other may somehow disrupt the target motifs or inhibit our ability to recognize co-mutation. However, we have simulated these enzymes acting in different orders, and find this does not significantly affect the ability to recognize co-mutated sequences (data not shown); therefore this seems to be an unlikely explanation.
Our analysis clearly illustrates that co-mutation by hA3G and hA3F is a rare occurrence. However, we are unable to draw conclusions as to the biological mechanisms behind this. Further sequencing of full-length HIV genomes derived from in vivo infection might help identify if the frequency of co-mutation is consistent with a Poisson distribution of hA3G and hA3F singly mutated genomes. However, our estimates are that this would require a large number of sequences to accurately estimate this frequency. Alternatively, experimental approaches to determine the accurate stoichiometry of hA3 molecules per virion in vivo might also identify whether a low frequency of incorporation and low copy number of incorporated enzymes is present.
In order to distinguish hypermutated from normal sequences a measure of the representation of hA3G and hA3F target and product motifs is required. The “representation” of a motif needs to be a quantity that signifies the difference between the observed and expected probabilities of the motif. Simply using the observed probability (relative frequency) of a motif to infer under- or over-representation is inappropriate. This is because the observed probability of a motif is not an independent entity and is influenced by the relative frequencies of its sub-motifs. For example the observed probability of the dinucleotide AA might be very high in a sequence, simply because the sequence has a high proportion of the mononucleotide A. Therefore AA, despite being frequent, is not over-represented. Thus the observed probabilities of sub-motifs need to be considered when estimating the expected probability of a motif.
The dinucleotide GG is changed to AG by hA3G; therefore, those HIV sequences that have been targeted by hA3G are expected to have a higher DRhA3G compared to normal sequences. By the same token, mutation by hA3F results in an increase in DRhA3F. Sequences that have been affected by both proteins hA3G and hA3F show an increase in both DRs. Importantly we do not measure simply the frequency ratio of AG/GG (for example in the case of hA3G). Rather, we find the ratio of the observed relative frequency of AG to the expected relative frequency of AG (based on the underlying relative frequencies of A and G in the sequence) divided by the ratio of the observed relative frequency of GG to the expected relative frequency, thus accounting for variations in base counts between sequences.
We note that this analysis of dinucleotide motifs can be extended to incorporate longer motifs, and indeed in our previous work we have studied the motif representation of dinucleotides, trinucleotides, and tetranucleotides . However, as the motif preference of hA3F is not found to extend beyond dinucleotide in the full genome in vivo HIV-1 sequences (see Figure S1 of the Additional file 1), we utilise only the dinucleotide motif for both enzymes. In addition, although factors such as codon bias and conserved regulatory motifs might affect the absolute value of the representation of motifs at a population level, they do not affect the proposed method that is based on the ‘difference’ in the representation of motifs from normal and hypermutated sequences. That is, we do not require normal sequences to have a ratio of exactly one, but rather empirically determine the ‘normal’ range for the ratio, which includes these factors.
In this work, x i is a vector of length two containing DRhA3G and DRhA3F of sequence i, x is a vector of length two containing the two averages of 2829 DRhA3G and DRhA3F from the normal HIV-1 sequences. S is the variance-covariance matrix.
where I is the number of HIV sequences (here 2829), J is the number of variables (here two, DRhA3G and DRhA3F), F is the Fisher’s F statistic at the confidence level α and degrees of freedom J and I J.
To test the accuracy of the group membership prediction by our proposed method, we performed the same analysis on 19 HIV-1 sequences mutated in vitro by hA3G or hA3F .
DE is funded by a Training Fellowship from the Australian National Health and Medical Research Council (NHMRC). FA is supported by a postgraduate scholarship from the Australian Government and the University of New South Wales. MPD is a senior Research Fellow funded by the NHMRC.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.