Viral variants were selected for experimental testing from a set of 94 V3 sequences of therapy-naïve patients. Viruses from the blood samples from the Bonn hemophiliac HIV cohort  were cultivated in PBMCs for one month and their NSI/SI phenotype was determined by light microscopy as described in . Viral RNA was isolated and sequenced as described in . The obtained V3 loop sequences were characterized with respect to their location in V3 sequence space , NSI/SI phenotype and tropism predicted by three computational methods: geno2pheno[coreceptor] , WebPSSM  and the 11/25 rule [11, 12].
Plasmids and cell lines
The V3 loop coding sequence was amplified by PCR from the RT-PCR products of patient samples or by PCR from plasmids comprising env sequences of lab-adapted strains, respectively. Primers used for amplification introduced PvuII (5’: 21 bp upstream of V3 loop sequence) and XbaI (3’: 6 bp downstream of V3 loop sequence) restriction sites, respectively. For some of the constructs, synthetic gene fragments encoding V3 loops with the respective flanking sites were used (GeneArt AG, Germany). Fragments were subcloned into pCAGGS.NL4-3-Xba, an EnvNL4-3 expression construct derived from pCAGGS.NL4-3 (kindly provided by S. Pöhlmann) harbouring unique XbaI and PvuII restriction sites flanking the V3 loop encoding sequence. The silent mutation generating an XbaI site was introduced by overlap PCR using primers introducing an A to T and a G to C conversion at positions 994 and 995 of the env coding sequence, respectively. To yield the respective pCHIV derivatives, a StuI/XhoI fragment (comprising part of env including the V3 loop) was transferred into pCHIVΔStu, a derivative of pCHIV , carrying a unique StuI site within the env gene. pCHIV derivatives encode for a non-infectious proviral NL4-3-based clone which expresses all viral proteins except the accessory protein Nef. A variant of pCHIV harbouring a frameshift mutation in the beginning of the env gene (pCHIV.Env-)  was used to produce isogenic viruses lacking the Env protein on their surface (Env-).
The coding sequence of an optimized version of a β-lactamase was amplified from pJH_SSbla_BlaFL_UDG_S1-1  by PCR with primers introducing KpnI and EcoRI restriction sites. The resulting fragment was subcloned into pMM310  to yield pBlaM.opt, encoding for a fusion protein including the HIV-1 accessory protein Vpr and β-lactamase, separated by a recognition site for the HIV-1 protease.
293T , cells were grown in Dulbecco’s modified Eagle’s medium (DMEM GlutaMAX; Invitrogen) supplemented with 100 U/ml penicillin, 100 μg/ml streptomycin and 10% fetal calf serum (FCS) at 37°C, 5% CO2. SupT1/CCR5 cells  were kept in RPMI1640 GlutaMAX™ supplemented with 100 U/ml penicillin, 100 μg/ml streptomycin and 10% FCS. Transfections were carried out using polyethyleneimine according to standard procedures.
β-lactamase virion fusion assay
Entry efficiency was determined by a previously described HIV fusion assay . Briefly, viral reporter particles were prepared from 293T cells co-transfected with the respective pCHIV derivative and pBlaM.opt (plasmid ratio 15:1). At 44 h post transfection, tissue culture supernatants were precleared by filtration through a 0.45 μm nitrocellulose filter and virions were purified by ultracentrifugation through a 20% (w/v) sucrose cushion. Proteins from pelleted particles were separated by SDS-PAGE (acrylamide:bisacrylamide 200:1, 17.5% acrylamide) and transferred to a nitrocellulose membrane by semi-dry blotting. Membranes were probed with polyclonal antisera raised against HIV-1 CA. Bound antibodies were detected by quantitative immunoblot using a LiCor Odyssey system and particle concentration was determined by comparison to purified CA protein analyzed in parallel. Adjusted amounts of virus that yielded about 30% of infection (as determined by titration experiments for each virus batch) were used to infect 1x106 SupT1/CCR5 cells seeded in V-bottom 96-wells. For experiments including coreceptor antagonists, cells were preincubated with varying concentrations of drugs for 1 h at 37°C before virus was added. Following incubation with virus at 37°C for 6 h, supernatant was removed and cells were incubated with the β-lactamase cleavable dye CCF2-AM (GeneBLAZER, Invitrogen) in staining medium according to the manufacturer’s instructions for 16 h at room temperature. Cells were fixed with 3% PFA/PBS for 1 h at room temperature and stained for receptor and coreceptor surface levels with monoclonal antibodies directly coupled to three different fluorophors (αCD4-APC-H7, clone RPT-4, αCD184-APC, clone 12G5 and αCD195-PE, clone 2D7/CCR5; BD Biosciences, Germany) for 1 h at room temperature. 100’000 cells per sample were analysed by flow cytometry on a BD FACS CantoII machine using FACSDiva Software. FCS2.0 files including the appropriate values for compensation (as determined by single-stained controls) were exported and subjected to computational analysis. To establish the entry positive gate, mock infections (no virus) were carried out in parallel for each experimental condition. In addition, viruses lacking the Env protein on their surface (Env-) were used in parallel to control for background signal independent of Env-receptor interactions.
In an initial step of data preprocessing, cell populations were filtered according to the FSC and SSC parameters (gating) to identify the major cell population and filter out cells not belonging to the major population as potentially defunct or of a different cell type. In order to address the shortcomings of the classical manual gating procedure and to obtain reproducible results efficiently, we developed an automated gating procedure. Cell populations are commonly represented by 2D scatter plots with FSC along the x-axis and SSC along the y-axis. In the automated gating procedure cells were first filtered through a user-defined square window defined on the FSC and SSC values. Here, we used a filter of 300 < FCS < 1200 and 0 < SSC < 900 based on visual inspection. Next, a 2D grid of FSC and SSC values was defined and the numbers of cells in each bin of this grid were calculated, which can be presented in form of a heat map (Additional file 1: Figure S4). Each bin of the grid had width and height 5, a value that was chosen among several others based on its gating agreement with the manual method. Next, for each of the values
on the x-axis (FSC) a grid bin position
was found that contained the maximum number of cells among all bins at the given
. For each such bin position
the minimum distance
was determined such that no cells were found in the bins at equidistant positions
were termed surrounding points. Next, to produce a smooth gating line, the y-coordinates of the surrounding points were averaged:
was replaced by the mean value of itself and two neighbouring surrounding points
The same procedure was repeated along the y-axis. A gate was defined as the minimal contiguous area on the FSC-SSC grid encircled by the line connecting consecutive surrounding points. Example results of the automated gating procedure are shown in Additional file 1: Figure S4. The mock infection measurement no virus was used as the control for establishing the gate. All measurements in the same experiment were gated accordingly. For reading the FCS2.0 files the R package prada was used, a part of Bioconductor .
A classical approach to BlaM-based entry classification entails a manually established decision boundary established based on the “no virus” control. The decision boundary is determined based on the plot of the intensity of blue (x-axis) against green (y-axis) CCF2 signal of the cells in the control measurement. It delineates the region of uninfected cells as a minimal region cells such that
and ~0.01% of the control cells are located outside of this region. Measurements of cells incubated with virus variants are classified according to this control-based decision boundary.
In order to efficiently classify the large number of flow cytometry measurements and to obtain reproducible results we established an automated method of BlaM classification. Our approach is based on a linear function
fitted to the blue (x) against green (y) CCF2 signal intensities of the cells of two merged controls – no virus and unstained. For each point on the fitted line
, data point
was found such that
that is the most distant from
, located on the line perpendicular to the fitted line that intersects the fitted line at
represent cells showing the highest shift in the blue signal relative to the green signal among the cells of the control measurements. Next, the distances of these points from the fitted line were smoothed using a sliding window approach by averaging the values within each window and adding of one standard deviation of the values within that window. The added standard deviation represents a margin beyond the control cells that ensures the required low proportion of false positives in the control measurement (~0.01%). A window size of 30 was selected as the size resulting in the best classification performance. The smoothed points projected back onto the plot of green and blue signals defined the cut-off decision line – cells represented by data points located in the part of the plot below the decision line were classified as entry positive, those located above the line were classified as entry negative (Additional file 1: Figure S5). The method design and the choice of parameters were guided by the comparison with manual classification with the goal of achieving the highest agreement on a large number of measurements (Additional file 1: Figure S6). Classification based on this procedure termed here binary classification assigns a binary value to each cell, 0 representing entry negative and 1 entry positive cells. To compensate for differences between the automated binary and manual classification, we additionally developed an alternative margin classification (Supplementary Information; Additional file 1: Figure S7). However, use of this classification method did not affect accuracy of regression models of the virus entry efficiency, and binary classification was chosen as the less complex approach throughout this study. An example of classification of virus measurement using both methods is depicted in Additional file 1: Figure S7. The steps of the classification procedure are described and illustrated in detail in the Supplementary Information.
The experimental results were depicted as 3D plots of virus entry efficiency in dependence on two chosen parameters. Colours of the plots represent the predicted phenotype of a variant – red for X4, blue for R5, magenta for variants of questionable tropism. Individual cells were localized in a 30x30 grid of values spanning the ranges of values of the two parameters. Virus entry efficiency was calculated as the fraction of infected cells (assigned using binary classification) within each bin of the grid. Prior to plotting, the grid was smoothed by averaging values from neighbouring bins of the grid. In order to account for the differing numbers of cells that show a given combination of parameter values, parts of the grid that contained less than a selected minimum number of cells are coloured in gray. The selected minimum number of cells is 10% of the expected number of cells assuming an even distribution of cells over the grid.
Merging the data
To compensate for potential noise in the measurement at the single-cell level, the data was aggregated into a multidimensional grid defined on aggregated values of CD4, CCR5 and CXCR4 and on all measured drug concentration levels. CD4, CCR5 and CXCR4 expression levels were first scaled to standard normal distribution and cells in the top and bottom 2.5% tails of the distribution were removed. Next, a five-dimensional grid was defined – spanning all tested concentration levels of AMD and MVC and a predefined number of values of the CD4, CCR5 and CXCR4 expression levels, separating their range of expression into bins of equal size. We tested four bin sizes – 5, 10, 20, 50 – for the quality of the resulting models. Values of binary classification of individual cells were averaged in each grid bin.
Data grids of cell entry measurements of each tested virus constructed in this way were merged across the individual experiments resulting in a multidimensional data grid describing each virus’ cell entry efficiency comprehensively across all experiments.
The selection criteria for model quality comprised two aspects: accuracy of model fit to the data and separation of the X4 and R5 phenotype vector. The first criterion – accuracy of model fit to the data – was based on the R2
measure estimated as:
is the sum of squared residuals with
being observed and
estimated output, and
is the total sum of squares proportional to the sample variance with
being the sample mean. R2 was used as a measure of agreement between the observed and modelled values with higher values representing a better agreement.
The second criterion – separation of the R5 and X4 phenotype vector – was used to obtain models capable of distinguishing between the two contrasting phenotypes. To measure the separation of the R5 and X4 models we used the Euclidean vector distance of the coefficient vectors of the two models. The model was selected for which the phenotype vectors of the R5 and X4 reference strains are most distant.
Prediction of phenotype vectors
For prediction we used binary sequence encoding in which each amino acid is represented by a binary vector of the length 20 with a single value 1 at the position indicating the present amino acid. The V3 loops of the 23 tested variants in this study include 88 positions that vary among the variants. We used two methods that involve shrinkage procedures for linear regression: Ridge regression  and Lasso . These methods were trained on the binary sequence encoding of the V3 sequences of the viruses with the respective phenotype vectors as output variables. For each position of the phenotype vector representing a separate output variable for the prediction method the penalty parameter
resulting in minimal prediction error in LOOCV was chosen from a sequence of 100 values. The chosen
values were used in further phenotype prediction. In addition to Ridge regression and Lasso, we tested the performance of linear regression based on a reduced number of input variables. The input variables were reduced to those showing significant (p < 0.01) Pearson correlation with any of the output variables (positions of the phenotype vector). Significance was calculated in 1000 permutation tests. The reduction procedure resulted in 26 and 17 input variables in the logarithmic and linear models of phenotype vectors, respectively.
Optimal training set size
Each variant’s phenotype was predicted based on a sampled training set of varying size increasing from 2 to 22. Training sets of each size were sampled multiple times; the prediction error of each clone was averaged for each size of the training set. Next, a polynomial function
termed error function was fitted to the relationship between the size of the training set and the prediction error. See Additional file 1: Figure S11 for examples of error functions.