# Factor correction as a tool to eliminate between-session variation in replicate experiments: application to molecular biology and retrovirology

- Jan M Ruijter
^{1}Email author, - Helene H Thygesen
^{2}, - Onard JLM Schoneveld
^{3, 4}, - Atze T Das
^{5}, - Ben Berkhout
^{5}and - Wouter H Lamers
^{3, 1}

**3**:2

**DOI: **10.1186/1742-4690-3-2

© Ruijter et al; licensee BioMed Central Ltd. 2006

**Received: **21 December 2005

**Accepted: **06 January 2006

**Published: **06 January 2006

## Abstract

### Background

In experimental biology, including retrovirology and molecular biology, replicate measurement sessions very often show similar proportional differences between experimental conditions, but different absolute values, even though the measurements were presumably carried out under identical circumstances. Although statistical programs enable the analysis of condition effects despite this replication error, this approach is hardly ever used for this purpose. On the contrary, most researchers deal with such between-session variation by normalisation or standardisation of the data. In normalisation all values in a session are divided by the observed value of the 'control' condition, whereas in standardisation, the sessions' means and standard deviations are used to correct the data. Normalisation, however, adds variation because the control value is not without error, while standardisation is biased if the data set is incomplete.

### Results

In most cases, between-session variation is multiplicative and can, therefore, be removed by division of the data in each session with a session-specific correction factor. Assuming one level of multiplicative between-session error, unbiased session factors can be calculated from all available data through the generation of a between-session ratio matrix. Alternatively, these factors can be estimated with a maximum likelihood approach. The effectiveness of this correction method, dubbed "factor correction", is demonstrated with examples from the field of molecular biology and retrovirology. Especially when not all conditions are included in every measurement session, factor correction results in smaller residual error than normalisation and standardisation and therefore allows the detection of smaller treatment differences. Factor correction was implemented into an easy-to-use computer program that is available on request at: biolab-services@amc.uva.nl?subject=factor.

### Conclusion

Factor correction is an effective and efficient way to deal with between-session variation in multi-session experiments.

## Background

The most popular methods used to remove between-session variation in bio-medical research are "normalisation" and "standardisation" [5]. In normalisation, a "control" condition is defined and per session all measured values (Y_{ni}) are divided by the control value in the session (Eq. 1, with session n, condition i and control condition 1).

Thus a single control condition is chosen to serve as a correction factor (100/Y_{n1} in Eq. 1). Figure 1B shows the data from Figure 1A when normalisation, using DNA construct 1 as control, is applied. Since DNA construct 1 was lost in one session (◆) normalisation led to the loss of this entire session. Normalisation does remove some between-session variation but because the control condition itself carries biological error, this can lead to an increased variation. The variation for constructs 6 and 8, i.e., is much larger after normalisation compared to the original data (compare Figs. 2A and 2B). Another drawback of normalisation is that it generates a control condition without variation. Since parametric statistical tests for the comparison of two or more conditions assume an equal variance in all conditions [6] these tests can no longer be used. Also most nonparametric tests are no longer applicable, because they require similar distributions in all conditions [7].

In standardisation [5], each value per session is transformed into a standard value by subtracting the session mean (
) and dividing the result by the session standard deviation (SD_{n}, Eq. 2).

Because the session mean after standardisation becomes zero for each session, standardisation removes between-session variation (Fig. 1C). However, the original measurement scale is lost and the overall mean becomes zero. Furthermore, if not all conditions are present in every session, the session mean and standard deviation will be biased. Because the standard deviation serves as multiplicative correction factor, this bias can result in added variability between sessions (as observed for the sessions indicated with triangles and filled diamonds in Fig. 1C). Standardisation can, therefore, only be used effectively when the data set is complete, that is, when all conditions are present in every session.

As mentioned above, the between session variation is due to multiplicative session factors. When known, these factors can be used to correct the data. As was demonstrated in the previous paragraphs, normalisation and standardisation both use correction factors that can lead to ineffective correction or even to an increased variation within conditions. For a correction method to be effective, the correction factors should be based on all available observations in the session and the estimation of these factors should not be affected by incomplete data sets. This paper describes such a correction method, dubbed "factor correction" and introduces two approaches to estimate correction factors. In the first, "ratio", approach the variation in the data set is assumed to be restricted to the condition effects whereas in the second, "maximum likelihood" approach part of the variation may result from variation among the factors affecting the individual measurement in each session. Both approaches turn out to result in very similar correction factors. Their use and effectiveness are illustrated using data sets from molecular biology and retrovirology.

## Results

### Mixed additive and multiplicative model

In the molecular-biology data set plotted in Figure 1A, the different DNA constructs represent the experimental conditions. Data from transfection experiments carried out on different days are the measurement sessions. The multiplicative nature of the between-session variation in this data set is apparent from the fact that the lines connecting the data points in each session run approximately parallel in a logarithmic plot of the data (Fig. 1A). In a multi-session experiment with such a multiplicative between-session variation, the observations can be described with a mixed additive and multiplicative model (Eq. 3).

Y_{ni} = F_{n} × (Y_{mean} + E_{i} + error_{ni}) (Eq. 3)

The additive part of this model, between parenthesis, states that the result of a measurement Y in condition i is the sum of the population mean (Y_{mean}), the effect of condition i (E_{i}), and an experimental error. Note that 'effect' in the sense used here does not represent the difference between a control and an experimental condition, but stands for the effect of each condition relative to the population mean. Therefore, the sum of the condition effects is 0 (
). In this model the biological error is normally distributed with mean 0 and standard deviation *σ*. This biological error reflects the variance *within* a condition, whereas the condition effects reflect the differences *between* conditions [6]. For each session n, the additive part of the observation is multiplied by session factor F_{n}. The product of the session factors equals 1 (
), which insures that the mean of Y_{ni} is still equal to the overall Y_{mean}.

The session factors can be estimated from all available data in the multi-session data set with two different approaches: calculation of a between-session ratio matrix (Ratio approach) or a maximum likelihood approach.

### Estimation of the session factors with the Ratio approach

To estimate the session factors with the Ratio approach for each pair of sessions, a between-session ratio is calculated (Eq. 4). For e.g. session 5 and 6, and condition i, this ratio is:

In such a between-session ratio, the normally distributed additive parts of the multi-session model (Y_{mean} + E_{i} + error_{ni}), have the same mean and standard deviation, and hence a ratio of 1. The error of such a ratio of normally distributed variables has a Cauchy distribution [8], which implies that, strictly speaking, its mean does not exist. However, the Cauchy distribution has a symmetrical clock shape centred on zero, has a median of zero [8] and, with a more general definition of integration, its mean can also be considered to be zero [9]. Therefore, on average, the error in the last term of Eq. 4 is zero and the term cancels out which makes the between-session ratio an unbiased estimate of the ratio of two session factors. When two sessions have more than one condition in common, a between-session ratio is calculated for each matching pair of conditions. Because we are dealing with multiplicative effects, the geometric mean of these ratios [10] is used in the between-session ratio matrix.

In the example data set (Fig. 1A), sessions 1 and session 6 have no conditions in common and, therefore, a between-session ratio cannot be directly calculated for this pair of sessions. To be able to calculate proper session factors without the loss of data sets like sessions 1 and 6, missing between-session ratios have to be substituted. It is possible to calculate a substitution for a missing ratio in column j and row i (R_{j/i}) from a known ratio in that column (e.g. R_{j/n}) and two other ratios from these two rows in another column (R_{k/i} and R_{k/n}). A substitute value for the missing ratio R_{j/i} is then calculated as R_{j/i} = R_{j/n} × R_{k/i}/R_{k/n}. If such a substitute is computed for all possible R_{j/n}, R_{k/i}, and R_{k/n} the geometric mean of all values will be the best estimate of the missing ratio R_{j/i}.

Because the product of all session factors in the multi-session model equals 1, the geometric mean of column i in this between-session ratio matrix is an estimate of the correction factor for session i:

The between-session variation in the original data set can now be removed by dividing each measured value by the corresponding session factor (Eq. 6):

The corrected data are shown in Fig. 1D.

### Estimation of session factors with the maximum likelihood approach

In the above mixed additive and multiplicative model the error term is normally distributed with a standard deviation *σ*. When we define
= *σ*·*F*_{
n
}and
= *Y*_{
i
}/*σ* with Y_{i} as the mean value per condition (Y_{i} = Y_{mean} + E_{i}; see Eq. 3) the model can be rewritten as *Y*_{
ni
}=
(
+ *error*_{
ni
}/*σ*), and
can then be shown to be normally distributed with mean 0 and standard deviation 1. Based on this form of the model, the likelihood of the observed set of Y_{ni} is given by Eq. 7

which is the chance of finding each individual observation Y_{ni} given F_{n} and Y_{i}, multiplied (Π) for all observations.

_{i,max}, = F

_{n,max}, then Y

_{i,max}and F

_{n,max}are found when the first derivatives in Y and F of the log of this likelihood function equal 0. The estimation equations for Y

_{i}and F

_{n}are not independent of each other and, therefore, an iterative procedure is required to estimate the sets of Y

_{i,max}and F

_{n,max}parameters.

This maximum likelihood approach results in a set of session factors (F_{n}) as well as estimates of condition means (Y_{i}). For both sets of parameters the maximum likelihood approach also estimates standard errors that can be used to compare factors and condition means among each other. Note that in this approach part of the variation in the data set is attributed to a variation in factor effect within a session. This is in contrast to the above ratio approach in which the factors are assumed to be fixed.

Results of the application of both methods for estimation of session factors on a simulated data set. A multi-session experiment with 5 sessions and 5 conditions was simulated with 5 observations per combination of session and condition. Each condition was measured in 4 different sessions. In simulating data, the overall mean was set to 100 and the standard deviation was set to 10. Factors and condition effects are given in the table. The estimated session factors are all close to the factors used in the simulation for both methods and the factors estimated with the ratio method are well within the variance of those estimated with the maximum likelihood approach. The condition means estimated with the maximum likelihood method are close to the values used in the simulation.

Ymean | sd | n | se | |
---|---|---|---|---|

100 | 10 | 20 | 2.24 | |

simulated | ratio observed | max. likelih. observed | ||

session | factor | factor | factor | se |

1 | 0.1 | 0.101 | 0.101 | 0.002 |

2 | 0.2 | 0.188 | 0.188 | 0.004 |

3 | 1 | 1.065 | 1.054 | 0.021 |

4 | 5 | 4.913 | 4.979 | 0.093 |

5 | 10 | 10.05 | 10.02 | 0.185 |

simulated | observed | |||

condition | effect | mean | se | |

A | -50 | 51.7 | 2.14 | |

B | -20 | 78.6 | 2.14 | |

C | 0 | 101.7 | 2.15 | |

D | 20 | 119.4 | 2.15 | |

E | 50 | 151.4 | 2.16 |

### Application of factor correction to molecular-biology data set

The result of normalisation and standardisation of the incomplete data set from Figure 1A are shown in Figures 1B and 1C and were discussed above. The result of factor correction (ratio approach) is plotted in Figure 1D. The factors estimated by maximum likelihood result in a graph that is indistinguishable. The reduced distance between the session lines in Figure 1D, compared to Figure 1A, shows that the multiplicative between-session variation has been removed successfully. This is also shown by the reduced variation within the conditions after factor correction (compare Fig. 2A and Fig. 2D). The remaining difference between the session lines (Fig. 1D) reflects the non-multiplicative component of the variation, which represents the error component in the multi-session model (Eq. 3). Compared to normalisation (Figs. 1B and 2B) and standardisation (Figs. 1C and 2C) the within-condition variation after factor correction is clearly reduced, demonstrating that factor correction is more effective in the removal of between-session variation. When the factor-corrected data are used to test the differences between each of the DNA constructs and construct 1, only constructs 3 and 6 are not significantly different (t-test; P = 0.095 and P = 0.071, respectively; Fig. 2D). The same test applied to normalised and standardised data reveals that only 2 and 1 DNA constructs, respectively, that differ significantly from construct 1 (asterisks in Figs. 2B and 2C). These results demonstrate that the power of the statistical comparison clearly increases after factor correction.

### Application of factor correction to retrovirology data set

## Discussion

This paper describes factor correction as an effective method to remove between-session variation from multi-session experiments. Using data sets from the fields of molecular biology and retrovirology, we demonstrate that factor correction effectively eliminates between-session variation in both complete and incomplete data sets. The corrected data set can be used reliably for statistical testing of differences between conditions, because the statistical error is not affected by factor correction. Moreover, the scale of the factor-corrected values can be considered to represent the original measurement scale.

Similar to normalisation and standardisation, factor correction is based on a multiplicative model for the variation observed in such multi-session experiments (Eq. 3). After normalisation, standardisation, and factor correction, the pattern of between-condition differences is very similar (Figs. 2 and 3). However, in normalisation, the control condition has lost its variance and the variance of all other conditions is larger than when factor correction is applied (cf. Figs. 2B and 2D, 3B and 3D). In other words, the variation that is lost in the control condition has been added to the other conditions. This is because the users of normalisation implicitly, but unjustifiably, assume that the control condition is error-free. Because the HIV-1 virology data set was complete the standardised and factor-corrected data set are very similar (cf. Figs. 3C and 3D). However, when standardisation is applied to an incomplete data set, both the session mean and the session standard deviation are not corrected for missing conditions, which may increase the variation for some conditions. The variation that is observed for e.g. constructs 2 and 5 in the molecular-biology data set is clearly larger after standardisation than after factor correction (cf. Figs. 2C and 2D). In factor correction, all available data are equally weighted to estimate session factors, which allows its use for incomplete data sets.

An alternative method to estimate the multiplicative factors in the mixed additive and multiplicative model is the use of two-way ANOVA after a logarithmic transformation of the data which converts the multiplicative session factor into an additive component. The application of two-way ANOVA without interaction between session and condition then results in a log-factor per session. Note that the condition effects that result from this two-way ANOVA are calculated as multiplicative effects and this will cause the factor estimates to differ marginally from those calculated either with the ratio approach or by maximum likelihood estimation (data not shown).

The two methods to estimate session factors described in this paper give slightly different results because the maximum likelihood approach assigns part of the variation to the estimated session factors. The ratio approach can be seen as a special case, in which the user assumes that the multiplicative factor is the same for every measurement in a session. Therefore, the maximum likelihood method is the more generally applicable of the two methods. In this paper the equations for the maximum likelihood approach have been developed for a one-way experimental design. Because the focus of this paper is to present an alternative for the unsound normalization often applied in the laboratory, we did not pursue the maximum likelihood estimation of session factors for more complex experimental designs. However, the current design enables the calculation of session factors as if the design is one-way and the application of these factors. The resulting factor-corrected data can then be used in a statistical package for further analysis.

When factor correction is used, sessions no longer have to be discarded because of loss of some data points in the laboratory procedure. Moreover, factor correction enables the correction of multi-session data sets that are necessarily incomplete because more conditions have to be tested than can be measured per session. Furthermore, because the control condition is no longer required in each session, resources can be used more efficiently. The smaller within-condition error after application of factor correction, as compared to normalisation and standardisation, increases the power of the statistical tests of biological hypotheses and reduces the required number of observations.

## Conclusion

We present factor correction as an effective and efficient method to eliminate between-session variation in multi-session experiments. The method was implemented in an easy-to-use computer program that is available on request at: biolab-services@amc.uva.nl?subject=factor. Factor correction helps experimental biologists to find the needle of biologically relevant information in the haystack of between-session variation.

## Methods

### Molecular-biology data set

The aim of the study from which this data set is derived was to examine the transcriptional activity of different combinations of enhancer, promoter and first intron elements of the rat Glutamine Synthetase (GS) gene [1]. To this end, DNA constructs containing different enhancer-promoter-intron sequences in front of the luciferase reporter gene were transfected into rat FTO-2B hepatoma cells by electroporation. Cells were co-transfected with a chloramphenicol acetyltransferase expression plasmid (pRSVcat). Sixteen hours after transfection the medium was refreshed and another 48 hours later the cells were harvested and tested for luciferase and CAT activity. The activity of the tested DNA construct was expressed as the ratio between the luciferase activity and the CAT activity.

### HIV-1-virology data set

HIV-1 constructs with a modified mechanism of transcription regulation [13] and variation in the viral Tat gene (to be described elsewhere) were transfected into human C33A cervix carcinoma cells as previously described [14]. Virus production was measured by CA-p24 ELISA on culture supernatant samples two days after transfection. The experiment was repeated seven times.

## Declarations

### Acknowledgements

The authors wish to thank Prof. Dr. Koos A.H. Zwinderman, Prof. Dr. Antoon F.M. Moorman, Dr. Fred W. van Leeuwen and Dr. Antoine H.C. van Kampen for their helpful discussions and critical comments during the preparation of this manuscript. We are indebted to the Bioinformatics Laboratory, Amsterdam, for managing the e-mail requests to biolab-services. Nicolai V. Sokhirev is acknowledged for making the PasMatLib http://www.shokhirev.com/nikolai/programs/tools/PasMatLib/PasMatLib.html available on the Internet.

## Authors’ Affiliations

## References

- Garcia de Vaes Lovillo RM, Ruijter JM, Labruyere WT, Hakvoort TBM, Lamers WH: Upstream and intronic regulatory sequences interact in the activation of the glutamine synthetase promoter. Eur J Biochem. 2003, 270: 206-212. 10.1046/j.1432-1033.2003.03424.x.View ArticleGoogle Scholar
- Hollon T, Yoshimura FK: Variation in enzymatic transient gene expression assays. Analytical Biochem. 1989, 182: 411-418. 10.1016/0003-2697(89)90616-7.View ArticleGoogle Scholar
- Richardson BA, Overbaugh J: Minireview. Basic statistical considerations in virological experiments. J Virol. 2005, 79: 669-676. 10.1128/JVI.79.2.669-676.2005.PubMed CentralView ArticlePubMedGoogle Scholar
- Anonymous: Statistically significant. Editorial. Nat Med. 2005, 11: 1-10.1038/nm0105-1.View ArticleGoogle Scholar
- Knox WE: Enzyme patterns in fetal, adult and neoplastic rat tissues. 1976, Basel, New York: S Karger, 64-67. 115–119.Google Scholar
- Sokal RR, Rohlf FJ: Biometry. The principle and practice of statistics in biological research. 1969, San Francisco: WH FreemanGoogle Scholar
- Conover WJ: Practical nonparametric statistics. 1980, New York: John WileyGoogle Scholar
- Johnson NL, Kotz S, Blakrishnan N: Continuous univariate distributions. 1994, New York: John Wiley, 1: 298-331.Google Scholar
- Meiser V: Computational science education project. 2.4.3 Cauchy distribution. [http://csep1.phy.ornl.gov/CSEP/MC/NODE20.html]
- Batschelet E: Introduction to mathematics for life scientists. 1975, Berlin: Springer Verlag, 14-15.View ArticleGoogle Scholar
- Snedecor GW, Cochran WG: Statistical methods. 1982, Ames: Iowa State University Press, 274-276.Google Scholar
- Kerr MK, Churchill GA: Statistical design and the analysis of gene expression microarray data. Genet Res. 2001, 77: 123-128. 10.1017/S0016672301005055.PubMedGoogle Scholar
- Verhoef K, Marzio G, Hillen W, Bujard H, Berkhout B: Strict control of human immunodeficiency virus type 1 replication by a genetic switch: Tet for Tat. J Virol. 2001, 75: 979-987. 10.1128/JVI.75.2.979-987.2001.PubMed CentralView ArticlePubMedGoogle Scholar
- Das AT, Zhou X, Vink M, Klaver B, Verhoef K, Marzio G, Berkhout B: Viral evolution as a tool to improve the tetracycline-regulated gene expression system. J Biol Chem. 2004, 279: 18776-18782. 10.1074/jbc.M313895200.View ArticlePubMedGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.