In order to distinguish hypermutated from normal sequences a measure of the representation of hA3G and hA3F target and product motifs is required. The “representation” of a motif needs to be a quantity that signifies the difference between the observed and expected probabilities of the motif. Simply using the observed probability (relative frequency) of a motif to infer under- or over-representation is inappropriate. This is because the observed probability of a motif is not an independent entity and is influenced by the relative frequencies of its sub-motifs. For example the observed probability of the dinucleotide AA might be very high in a sequence, simply because the sequence has a high proportion of the mononucleotide A. Therefore AA, despite being frequent, is not over-represented. Thus the observed probabilities of sub-motifs need to be considered when estimating the expected probability of a motif.

Representation can be defined as a ratio of observed over expected probabilities. The observed probability (p

_{obs}) of a motif is the total counts of the motif (e.g. AG) in the sequence divided by the total counts of all other possible motifs with the same length (AA, AC, AG, …, TT). The expected probability (p

_{exp}) of a motif can be calculated using the observed probabilities of its sub-motifs
[

20]. For example the expected probability of the dinucleotide AG is the product of the observed probabilities of the mononucleotides A and G. Eq.

1 shows the representation (D) of dinucleotide AG as a typical example.

The representations of hA3G and/or hA3F target motifs (GG and GA, respectively) decrease and those of product motifs (AG and AA, respectively) increase in hypermutated sequences compared to normal HIV sequences. We define two ratios of product over target representations, one for hA3G (Eq.

2) and one for hA3F (Eq.

3). As will be described later these diagnostic ratios (DRs hereafter) are used together in a bivariate distribution to identify hypermutated sequences.

The dinucleotide GG is changed to AG by hA3G; therefore, those HIV sequences that have been targeted by hA3G are expected to have a higher DR_{hA3G} compared to normal sequences. By the same token, mutation by hA3F results in an increase in DR_{hA3F}. Sequences that have been affected by both proteins hA3G and hA3F show an increase in both DRs. Importantly we do not measure simply the frequency ratio of AG/GG (for example in the case of hA3G). Rather, we find the ratio of the observed relative frequency of AG to the expected relative frequency of AG (based on the underlying relative frequencies of A and G in the sequence) divided by the ratio of the observed relative frequency of GG to the expected relative frequency, thus accounting for variations in base counts between sequences.

We note that this analysis of dinucleotide motifs can be extended to incorporate longer motifs, and indeed in our previous work we have studied the motif representation of dinucleotides, trinucleotides, and tetranucleotides
[20]. However, as the motif preference of hA3F is not found to extend beyond dinucleotide in the full genome *in vivo* HIV-1 sequences (see Figure S1 of the Additional file
1), we utilise only the dinucleotide motif for both enzymes. In addition, although factors such as codon bias and conserved regulatory motifs might affect the absolute value of the representation of motifs at a population level, they do not affect the proposed method that is based on the ‘difference’ in the representation of motifs from normal and hypermutated sequences. That is, we do not require normal sequences to have a ratio of exactly one, but rather empirically determine the ‘normal’ range for the ratio, which includes these factors.

We downloaded 2829 full genome (> 7000 n.t.) HIV-1 sequences from the LANL database as well as 88 sequences identified as “hypermutated” by LANL, in June 2011. For each sequence we calculated DR

_{hA3G} and DR

_{hA3F} and then the Hotelling’s

*T*
^{
2
} statistic. The Hotelling’s

*T*
^{
2
} statistic (Eq.

4) is an extension of the Student

*t* statistic to multivariate distributions. It is used to determine group membership in data with more than one measured variable
[

25].

In this work, *x*
_{
i
} is a vector of length two containing DR_{hA3G} and DR_{hA3F} of sequence *i*, *x* is a vector of length two containing the two averages of 2829 DR_{hA3G} and DR_{hA3F} from the normal HIV-1 sequences. S is the variance-covariance matrix.

The Hotelling’s

*T*
^{
2
} statistic of a given HIV-1 sequence is the square of the Mahalanobis distance of the sequence from the centre of the population of normal HIV-1 sequences in a two-dimensional space specified by DR

_{hA3G} and DR

_{hA3F}. The larger this distance the less likely the sequence is normal, and therefore the more likely it is hypermutated. For each sequence, the likelihood of its membership to the normal HIV-1 population is quantified using the probability associated with its

*T*
^{
2
} statistic
[

25]. The confidence level (α) of the Hotelling

*T*
^{
2
} statistic is given by Eq.

5
where *I* is the number of HIV sequences (here 2829), *J* is the number of variables (here two, DR_{hA3G} and DR_{hA3F}), *F* is the Fisher’s *F* statistic at the confidence level α and degrees of freedom *J* and *I*
*J*.

To test the accuracy of the group membership prediction by our proposed method, we performed the same analysis on 19 HIV-1 sequences mutated *in vitro* by hA3G or hA3F
[7].