Home

Articles
Tutorials

Single-Pass Online Statistics Algorithms

Nikolai Shokhirev

March 13, 2013

1. Introduction

The "textbook" two-pass algorithm for the centered moments (variance, skewness, kurtosis, covariance) is obviously inefficient. There were suggested several alternative algorithms [1 - 16]. Unfortunately I could not find suitable one-pass windowed algorithms similar to the mowing average [17]. Therefore I decided to derive the equations and implement them myself.

2. Problem statement

We assume that there is a stream of data: $x_{1},\, x_{2},\, x_{3},\,\ldots\,,\, x_{k}$ . Here $x_{k}$ is the most recent value. There are two variants of statistics: (i) a cumulative statistics of all $k$ values, and (ii) the statistics of $n$ the most recent values. The latter is also called a moving or a windowed statistics. Note that the cumulative statistics is also a windowed with $n=k$.

3. The first moment, mean value

3.1. Definition

The first moment (moving average, mean value) of the length $n$ is defined as \begin{equation} m_{k}(n)=\frac{1}{n}\sum_{i=k-n+1}^{k}x_{i}=\frac{1}{n}S_{k}(n)\label{eq:avg} \end{equation} Below for $k=n$ we use the notation $X_{k}(k)=X_{k}$.

3.2. Cumulative calculations

When all incoming values are incorporated into calculations ($n=k$), the above equation (\ref{eq:avg}) reduces to: \begin{equation} m_{k}=m_{k}(k)=\frac{1}{k}\sum_{i=1}^{k}x_{i}=\frac{1}{k}S_{k}\label{eq:avgc} \end{equation} From Eq. \ref{eq:avgc} we get the following recurrence: \begin{eqnarray} m_{k} & = & \frac{1}{k}S_{k}=\frac{1}{k}\left(x_{k}+S_{k-1}\right)=\frac{x_{k}}{k}+\frac{k-1}{k}m_{k-1}=m_{k-1}+\frac{x_{k}-m_{k-1}}{k}\nonumber \\ & = & m_{k-1}+\frac{\delta_{k}}{k}\label{eq:recc} \end{eqnarray} where \begin{equation} \delta_{k}=x_{k}-m_{k-1}\label{eq:dk} \end{equation}

3.3. Windowed (constant-length, moving) calculations

In the case when only the last $n$ values are used, the recurrence is \begin{eqnarray} m_{k}(n) & = & \frac{1}{n}S_{k}(n)=\frac{1}{n}\left(x_{k}+S_{k-1}(n)-x_{k-n}\right)=\frac{x_{k}}{k}+\frac{k-1}{k}m_{k-1}=m_{k-1}+\frac{x_{k}-x_{k-n}}{n}\nonumber \\ & = & m_{k-1}(n)+\frac{\delta_{k}-\delta_{k-n}}{n}\label{eq:recw} \end{eqnarray} Here \begin{equation} \delta_{k-n}=x_{k-n}-m_{k-1}\label{eq:dkn} \end{equation} Note that (\ref{eq:recw}) reduces to (\ref{eq:recc}) if we set \begin{eqnarray} \delta_{k-n} & = & 0\nonumber \\ n & = & k\label{eq:w2c} \end{eqnarray} Instead of (\ref{eq:w2c}) we can also formally define $\delta_{0}=0$.

4. Central moments

4.1. Definitions

Similar to (\ref{eq:avg}), the p-th order central moments are defined as \begin{equation} M_{p,k}(n)=\frac{1}{n}\sum_{i=k-n+1}^{k}\left[x_{i}-m_{k}(n)\right]^{p}=\frac{1}{n}S_{p,k}(n)\label{eq:avgp} \end{equation}

4.2. Windowed variance ($p=2$)

For the case of $p=2$ Eq. \ref{eq:avgp} reduces to \begin{equation} M_{2,k}(n)=\frac{1}{n}\sum_{i=k-n+1}^{k}\left[x_{i}-m_{k}(n)\right]^{2}=\frac{1}{n}S_{p,k}(n)\label{eq:avg2} \end{equation} It can be rewritten using (\ref{eq:recw}): \begin{eqnarray} S_{2,k}(n) & = & \sum_{i=k-n+1}^{k}\left[x_{i}-m_{k}(n)\right]^{2}=\sum_{i=k-n+1}^{k}\left[x_{i}-m_{k-1}(n)-\frac{\delta_{k}-\delta_{k-n}}{n}\right]^{2}\nonumber \\ & = & \sum_{i=k-n+1}^{k}\left\{ \left[x_{i}-m_{k-1}(n)\right]^{2}-2\left[x_{i}-m_{k-1}(n)\right]\frac{\delta_{k}-\delta_{k-n}}{n}+\left(\frac{\delta_{k}-\delta_{k-n}}{n}\right)^{2}\right\} \label{eq:S2w} \end{eqnarray} or \begin{eqnarray} S_{2,k}(n) & = & S_{2,k-1}(n)+\delta_{k}^{2}-\delta_{k-n}^{2}-2(\delta_{k}-\delta_{k-n})[m_{k}(n)-m_{k-1}(n)]+\frac{(\delta_{k}-\delta_{k-n})^{2}}{n}\nonumber \\ & = & S_{2,k-1}(n)+\delta_{k}^{2}-\delta_{k-n}^{2}-\frac{(\delta_{k}-\delta_{k-n})^{2}}{n}\label{eq:S2wrec} \end{eqnarray} This gives the recurrence formula for the windowed online variance calculations.

4.3. Cumulative variance calculations

In order to get the cumulative formula, we can formally apply the conditions (\ref{eq:w2c}): \begin{equation} S_{2,k}=S_{2,k-1}+\delta_{k}^{2}-\frac{\delta_{k}^{2}}{k}=S_{2,k-1}+\frac{k-1}{k}\delta_{k}^{2}\label{eq:rec2c} \end{equation} This can be also rewritten as \[ S_{2,k}=S_{2,k-1}+(x_{k}-m_{k})(x_{k}-m_{k-1}) \]

5. Higher central sums

5.1. Windowed calculations for $p=3$

Similar to (\ref{eq:S2w}): \begin{eqnarray} S_{3,k}(n) & = & \sum_{i=k-n+1}^{k}\left[x_{i}-m_{k}(n)\right]^{3}=\sum_{i=k-n+1}^{k}\left[x_{i}-m_{k-1}(n)-\frac{\delta_{k}-\delta_{k-n}}{n}\right]^{3}\nonumber \\ & = & \sum_{i=k-n+1}^{k}\left\{ \begin{array}{c} \left[x_{i}-m_{k-1}(n)\right]^{3}-3\left[x_{i}-m_{k-1}(n)\right]^{2}\frac{\delta_{k}-\delta_{k-n}}{n}\\ +3\left[x_{i}-m_{k-1}(n)\right]\left(\frac{\delta_{k}-\delta_{k-n}}{n}\right)^{2}-\left(\frac{\delta_{k}-\delta_{k-n}}{n}\right)^{3} \end{array}\right\} \label{eq:S3w} \end{eqnarray} or \begin{equation} S_{3,k}(n)=S_{3,k-1}(n)+\delta_{k}^{3}-\delta_{k-n}^{3}-\frac{3}{n}(\delta_{k}-\delta_{k-n})[S_{2,k-1}(n)+\delta_{k}^{2}-\delta_{k-n}^{2}]+\frac{2}{n^{2}}(\delta_{k}-\delta_{k-n})^{3}\label{eq:S3wrec} \end{equation}

5.2. Cumulative calculations for $p=3$

Using the same trick and applying the condition (\ref{eq:w2c}), we get: \begin{equation} S_{3,k}=S_{3,k-1}+\delta_{k}^{3}-\frac{3}{k}\delta_{k}[S_{2,k-1}+\delta_{k}^{2}]+\frac{2}{k^{2}}\delta_{k}^{3} \end{equation} or \begin{equation} S_{3,k}=S_{3,k-1}-\frac{3}{k}S_{2,k-1}\delta_{k}+\frac{(k-1)(k-2)}{k^{2}}\delta_{k}^{3} \end{equation}

5.3. Windowed calculations for $p=4$

Similar to (\ref{eq:S2w}) and (\ref{eq:S3w}): \begin{eqnarray} S_{4,k}(n) & = & \sum_{i=k-n+1}^{k}\left[x_{i}-m_{k}(n)\right]^{4}=\sum_{i=k-n+1}^{k}\left[x_{i}-m_{k-1}(n)-\frac{\delta_{k}-\delta_{k-n}}{n}\right]^{4}\nonumber \\ & = & \sum_{i=k-n+1}^{k}\left\{ \begin{array}{c} \left[x_{i}-m_{k-1}(n)\right]^{3}-4\left[x_{i}-m_{k-1}(n)\right]^{3}\frac{\delta_{k}-\delta_{k-n}}{n}+\left(\frac{\delta_{k}-\delta_{k-n}}{n}\right)^{4}\\ +6\left[x_{i}-m_{k-1}(n)\right]^{2}\left(\frac{\delta_{k}-\delta_{k-n}}{n}\right)^{2}-4\left[x_{i}-m_{k-1}(n)\right]\left(\frac{\delta_{k}-\delta_{k-n}}{n}\right)^{3} \end{array}\right\} \end{eqnarray} or \begin{eqnarray} S_{4,k}(n) & = & S_{4,k-1}(n)+\delta_{k}^{4}-\delta_{k-n}^{4}-\frac{4}{n}(\delta_{k}-\delta_{k-n})[S_{3,k-1}(n)+\delta_{k}^{3}-\delta_{k-n}^{3}]\nonumber \\ & + & \frac{6}{n^{2}}(\delta_{k}-\delta_{k-n})^{2}[S_{2,k-1}(n)+\delta_{k}^{2}-\delta_{k-n}^{2}]-\frac{3}{n^{3}}(\delta_{k}-\delta_{k-n})^{4} \end{eqnarray}

5.4. Cumulative calculations for $p=4$

Using the same trick and applying the condition (\ref{eq:w2c}), we get: \begin{equation} S_{4,k}=S_{4,k-1}+\delta_{k}^{4}-\frac{4}{k}\delta_{k}[S_{3,k-1}+\delta_{k}^{3}]+\frac{6}{k^{2}}\delta_{k}^{2}[S_{2,k-1}+\delta_{k}^{2}]-\frac{3}{k^{3}}\delta_{k}^{4} \end{equation} or \begin{equation} S_{4,k}=S_{4,k-1}-\frac{4}{k}S_{3,k-1}\delta_{k}+\frac{6}{k^{2}}S_{2,k-1}\delta_{k}^{2}+\frac{(k-1)(k^{2}-3k+3)}{k^{2}}\delta_{k}^{4} \end{equation}

6. Covariance

The above approach can be generalized for the calculation of the covariance matrix elements.

6.1. Definition

The covariance of the components $x_{\mu}$ and $x_{\nu}$ is defined as \begin{equation} c_{\mu\nu,k}(n)=\frac{1}{n}\sum_{i=k-n+1}^{k}\left[x_{\mu,i}-m_{\mu,k}(n)\right]\left[x_{\nu,i}-m_{\nu,k}(n)\right] = \frac{1}{n}C_{\mu\nu,k}(n)\label{eq:covw} \end{equation} Here $m_{\mu,k}(n)$ is the mean value for the $\mu$-th component. It obeys the generalized equation (\ref{eq:recc}): \begin{equation} m_{\mu,k}(n)=m_{\mu,k-1}(n)+\frac{\delta_{\mu,k}-\delta_{\mu,k-n}}{n}\label{eq:mmunuk} \end{equation} with Eqs. (\ref{eq:dk}) and (\ref{eq:dkn}) modified accordingly.

6.2. Windowed calculations

Using the recurrence (\ref{eq:mmunuk}) for each component we have \begin{equation} C_{\mu\nu,k}(n)=\sum_{i=k-n+1}^{k}\left[x_{\mu,i}-m_{\mu,k-1}(n)-\frac{\delta_{\mu,k}-\delta_{\mu,k-n}}{n}\right]\left[x_{\nu,i}-m_{\nu,k-1}(n)-\frac{\delta_{\nu,k}-\delta_{\nu,k-n}}{n}\right] \end{equation} The above equation can be rewritten as \begin{equation} C_{\mu\nu,k}(n)=\sum_{i=k-n+1}^{k}\left\{ \begin{array}{c} \left[x_{\mu,i}-m_{\mu,k-1}(n)\right]\left[x_{\nu,i}-m_{\nu,k-1}(n)\right]-\left[x_{\mu,i}-m_{\mu,k-1}(n)\right]\frac{\delta_{\nu,k}-\delta_{\nu,k-n}}{n}\\ \frac{\delta_{\mu,k}-\delta_{\mu,k-n}}{n}\left[x_{\nu,i}-m_{\nu,k-1}(n)\right]+\frac{\delta_{\mu,k}-\delta_{\mu,k-n}}{n}\,\frac{\delta_{\nu,k}-\delta_{\nu,k-n}}{n} \end{array}\right\} \end{equation} or \begin{equation} C_{\mu\nu,k}(n)=C_{\mu\nu,k-1}(n)+\delta_{\mu,k}\delta_{\nu,k}-\delta_{\mu,k-n}\delta_{\nu,k-n}-\frac{(\delta_{\mu,k}-\delta_{\mu,k-n})(\delta_{\nu,k}-\delta_{\nu,k-n})}{n}\label{eq:Cmunuk} \end{equation} This gives the recurrence formula for the windowed online covariance calculations.

6.3. Cumulative calculations

Applying the conditions (\ref{eq:w2c}) to each component, we get: \begin{eqnarray} C_{\mu\nu,k} & = & C_{\mu\nu,k-1}+\delta_{\mu,k}\delta_{\nu,k}-\frac{\delta_{\mu,k}\delta_{\nu,k}}{k}\nonumber \\ & = & C_{\mu\nu,k-1}+\delta_{\mu,k}\delta_{\nu,k}-\frac{k-1}{k}\delta_{\mu,k}\delta_{\nu,k} \end{eqnarray} This gives the recurrence formula for the cumulative online covariance calculations. Note that for $\mu=\nu$ the above equations reduce to the variance formulas.

7. Implementation

My Python implementation can be found on GitHub: https://github.com/dr-nikolai/online_stat. It also includes the Jupyter notebook with a test/demo.

Note that this code is not fully optimized. It is primarily an illustration to this article.

Acknowledgement

I am grateful to Marko Draisma who pointed to a typo in Eq. 18.

Suggested citation for this article:

Nikolai Shokhirev, 2013. Single-Pass Online Statistical Algorithms, http://www.numericalexpert.com/articles/single_pass_stat

References

  1. Chan, T. F.; Golub, G. H. and LeVeque, R. J., Updating formulae and a pairwise algorithm for computing sample variances, COMPSTAT 1982 5th Symposium held at Toulouse 1982, 1982, 30-412
  2. Chan, T. F.; Golub, G. H. and LeVeque, R. J., Algorithms for computing the sample variance: Analysis and recommendations The American Statistician, Taylor & Francis Group, 1983, 37, 242-247
  3. Datar, M.; Gionis, A.; Indyk, P. and Motwani, R., Maintaining stream statistics over sliding windows, SIAM Journal on Computing, SIAM, 2002, 31, 1794-1813
  4. Zivot, E. and Wang, J., Rolling Analysis of Time Series, Modeling Financial Time Series with S-Plus, Springer, 2003, 299-346
  5. Pébay, P., Formulas for robust, one-pass parallel computation of covariances and arbitrary-order statistical moments. Sandia National Laboratories, 2008
  6. Alexander, C., Moving Average Models for Volatility and Correlation, and Covariance Matrices, in Handbook of Finance, John Wiley & Sons, Inc., 2008, 2-14
  7. Finch, T., Incremental calculation of weighted mean and variance, University of Cambridge, 2009, 4, 11-5
  8. Bennett, J.; Grout, R.; Pébay, P.; Roe, D. and Thompson, D., Numerically stable, single-pass, parallel statistics algorithms, Cluster Computing and Workshops, 2009. CLUSTER'09. IEEE International Conference on, 2009, 1-8
  9. Choi, M. and Sweetman, B., Efficient calculation of statistical moments for structural health monitoring. Structural Health Monitoring, SAGE Publications, 2010, 9, 13-24
  10. McCrary, S., Implementing Algorithms to Measure Common Statistics, Available at SSRN 2695198, 2015
  11. Wikipedia, Algorithms for calculating variance, 2016.
  12. Cook, J. D., Accurately computing running variance, 2008.
  13. Cook, J. D., Computing skewness and kurtosis in one pass, 2008.
  14. TurnerSR, Single-pass, parallel statistics algorithms for mean, variance, and standard deviation, 2014.
  15. Jenks, G., RunStats: Computing Statistics and Regression in One Pass, 2015.
  16. Babcock, B.; Datar, M.; Motwani, R. and O'Callaghan, L. Maintaining variance and k-medians over data stream windows Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, 2003, 234-243
  17. Wikipedia, Moving average, 2016.

© Nikolai Shokhirev, 2012-2024

email: nikolai(dot)shokhirev(at)gmail(dot)com

Count: