Single-Pass Online Statistics Algorithms
Nikolai Shokhirev
March 13, 2013
1. Introduction
The "textbook" two-pass algorithm for the centered moments (variance,
skewness, kurtosis, covariance) is obviously inefficient. There were
suggested several alternative algorithms [1 - 16].
Unfortunately I could not find suitable one-pass windowed algorithms
similar to the mowing average [17]. Therefore I decided
to derive the equations and implement them myself.
2. Problem statement
We assume that there is a stream of data: $x_{1},\, x_{2},\, x_{3},\,\ldots\,,\, x_{k}$
. Here $x_{k}$ is the most recent value. There are two variants of
statistics: (i) a cumulative statistics of all $k$ values, and (ii)
the statistics of $n$ the most recent values. The latter is also
called a moving or a windowed statistics. Note that the cumulative
statistics is also a windowed with $n=k$.
3. The first moment, mean value
3.1. Definition
The first moment (moving average, mean value) of the length $n$ is
defined as
\begin{equation}
m_{k}(n)=\frac{1}{n}\sum_{i=k-n+1}^{k}x_{i}=\frac{1}{n}S_{k}(n)\label{eq:avg}
\end{equation}
Below for $k=n$ we use the notation $X_{k}(k)=X_{k}$.
3.2. Cumulative calculations
When all incoming values are incorporated into calculations ($n=k$),
the above equation (\ref{eq:avg}) reduces to:
\begin{equation}
m_{k}=m_{k}(k)=\frac{1}{k}\sum_{i=1}^{k}x_{i}=\frac{1}{k}S_{k}\label{eq:avgc}
\end{equation}
From Eq. \ref{eq:avgc} we get the following recurrence:
\begin{eqnarray}
m_{k} & = & \frac{1}{k}S_{k}=\frac{1}{k}\left(x_{k}+S_{k-1}\right)=\frac{x_{k}}{k}+\frac{k-1}{k}m_{k-1}=m_{k-1}+\frac{x_{k}-m_{k-1}}{k}\nonumber \\
& = & m_{k-1}+\frac{\delta_{k}}{k}\label{eq:recc}
\end{eqnarray}
where
\begin{equation}
\delta_{k}=x_{k}-m_{k-1}\label{eq:dk}
\end{equation}
3.3. Windowed (constant-length, moving) calculations
In the case when only the last $n$ values are used, the recurrence
is
\begin{eqnarray}
m_{k}(n) & = & \frac{1}{n}S_{k}(n)=\frac{1}{n}\left(x_{k}+S_{k-1}(n)-x_{k-n}\right)=\frac{x_{k}}{k}+\frac{k-1}{k}m_{k-1}=m_{k-1}+\frac{x_{k}-x_{k-n}}{n}\nonumber \\
& = & m_{k-1}(n)+\frac{\delta_{k}-\delta_{k-n}}{n}\label{eq:recw}
\end{eqnarray}
Here
\begin{equation}
\delta_{k-n}=x_{k-n}-m_{k-1}\label{eq:dkn}
\end{equation}
Note that (\ref{eq:recw}) reduces to (\ref{eq:recc}) if we set
\begin{eqnarray}
\delta_{k-n} & = & 0\nonumber \\
n & = & k\label{eq:w2c}
\end{eqnarray}
Instead of (\ref{eq:w2c}) we can also formally define $\delta_{0}=0$.
4. Central moments
4.1. Definitions
Similar to (\ref{eq:avg}), the p-th order central moments are defined
as
\begin{equation}
M_{p,k}(n)=\frac{1}{n}\sum_{i=k-n+1}^{k}\left[x_{i}-m_{k}(n)\right]^{p}=\frac{1}{n}S_{p,k}(n)\label{eq:avgp}
\end{equation}
4.2. Windowed variance ($p=2$)
For the case of $p=2$ Eq. \ref{eq:avgp} reduces to
\begin{equation}
M_{2,k}(n)=\frac{1}{n}\sum_{i=k-n+1}^{k}\left[x_{i}-m_{k}(n)\right]^{2}=\frac{1}{n}S_{p,k}(n)\label{eq:avg2}
\end{equation}
It can be rewritten using (\ref{eq:recw}):
\begin{eqnarray}
S_{2,k}(n) & = & \sum_{i=k-n+1}^{k}\left[x_{i}-m_{k}(n)\right]^{2}=\sum_{i=k-n+1}^{k}\left[x_{i}-m_{k-1}(n)-\frac{\delta_{k}-\delta_{k-n}}{n}\right]^{2}\nonumber \\
& = & \sum_{i=k-n+1}^{k}\left\{ \left[x_{i}-m_{k-1}(n)\right]^{2}-2\left[x_{i}-m_{k-1}(n)\right]\frac{\delta_{k}-\delta_{k-n}}{n}+\left(\frac{\delta_{k}-\delta_{k-n}}{n}\right)^{2}\right\} \label{eq:S2w}
\end{eqnarray}
or
\begin{eqnarray}
S_{2,k}(n) & = & S_{2,k-1}(n)+\delta_{k}^{2}-\delta_{k-n}^{2}-2(\delta_{k}-\delta_{k-n})[m_{k}(n)-m_{k-1}(n)]+\frac{(\delta_{k}-\delta_{k-n})^{2}}{n}\nonumber \\
& = & S_{2,k-1}(n)+\delta_{k}^{2}-\delta_{k-n}^{2}-\frac{(\delta_{k}-\delta_{k-n})^{2}}{n}\label{eq:S2wrec}
\end{eqnarray}
This gives the recurrence formula for the windowed online variance
calculations.
4.3. Cumulative variance calculations
In order to get the cumulative formula, we can formally apply the
conditions (\ref{eq:w2c}):
\begin{equation}
S_{2,k}=S_{2,k-1}+\delta_{k}^{2}-\frac{\delta_{k}^{2}}{k}=S_{2,k-1}+\frac{k-1}{k}\delta_{k}^{2}\label{eq:rec2c}
\end{equation}
This can be also rewritten as
\[
S_{2,k}=S_{2,k-1}+(x_{k}-m_{k})(x_{k}-m_{k-1})
\]
5. Higher central sums
5.1. Windowed calculations for $p=3$
Similar to (\ref{eq:S2w}):
\begin{eqnarray}
S_{3,k}(n) & = & \sum_{i=k-n+1}^{k}\left[x_{i}-m_{k}(n)\right]^{3}=\sum_{i=k-n+1}^{k}\left[x_{i}-m_{k-1}(n)-\frac{\delta_{k}-\delta_{k-n}}{n}\right]^{3}\nonumber \\
& = & \sum_{i=k-n+1}^{k}\left\{ \begin{array}{c}
\left[x_{i}-m_{k-1}(n)\right]^{3}-3\left[x_{i}-m_{k-1}(n)\right]^{2}\frac{\delta_{k}-\delta_{k-n}}{n}\\
+3\left[x_{i}-m_{k-1}(n)\right]\left(\frac{\delta_{k}-\delta_{k-n}}{n}\right)^{2}-\left(\frac{\delta_{k}-\delta_{k-n}}{n}\right)^{3}
\end{array}\right\} \label{eq:S3w}
\end{eqnarray}
or
\begin{equation}
S_{3,k}(n)=S_{3,k-1}(n)+\delta_{k}^{3}-\delta_{k-n}^{3}-\frac{3}{n}(\delta_{k}-\delta_{k-n})[S_{2,k-1}(n)+\delta_{k}^{2}-\delta_{k-n}^{2}]+\frac{2}{n^{2}}(\delta_{k}-\delta_{k-n})^{3}\label{eq:S3wrec}
\end{equation}
5.2. Cumulative calculations for $p=3$
Using the same trick and applying the condition (\ref{eq:w2c}), we
get:
\begin{equation}
S_{3,k}=S_{3,k-1}+\delta_{k}^{3}-\frac{3}{k}\delta_{k}[S_{2,k-1}+\delta_{k}^{2}]+\frac{2}{k^{2}}\delta_{k}^{3}
\end{equation}
or
\begin{equation}
S_{3,k}=S_{3,k-1}-\frac{3}{k}S_{2,k-1}\delta_{k}+\frac{(k-1)(k-2)}{k^{2}}\delta_{k}^{3}
\end{equation}
5.3. Windowed calculations for $p=4$
Similar to (\ref{eq:S2w}) and (\ref{eq:S3w}):
\begin{eqnarray}
S_{4,k}(n) & = & \sum_{i=k-n+1}^{k}\left[x_{i}-m_{k}(n)\right]^{4}=\sum_{i=k-n+1}^{k}\left[x_{i}-m_{k-1}(n)-\frac{\delta_{k}-\delta_{k-n}}{n}\right]^{4}\nonumber \\
& = & \sum_{i=k-n+1}^{k}\left\{ \begin{array}{c}
\left[x_{i}-m_{k-1}(n)\right]^{3}-4\left[x_{i}-m_{k-1}(n)\right]^{3}\frac{\delta_{k}-\delta_{k-n}}{n}+\left(\frac{\delta_{k}-\delta_{k-n}}{n}\right)^{4}\\
+6\left[x_{i}-m_{k-1}(n)\right]^{2}\left(\frac{\delta_{k}-\delta_{k-n}}{n}\right)^{2}-4\left[x_{i}-m_{k-1}(n)\right]\left(\frac{\delta_{k}-\delta_{k-n}}{n}\right)^{3}
\end{array}\right\}
\end{eqnarray}
or
\begin{eqnarray}
S_{4,k}(n) & = & S_{4,k-1}(n)+\delta_{k}^{4}-\delta_{k-n}^{4}-\frac{4}{n}(\delta_{k}-\delta_{k-n})[S_{3,k-1}(n)+\delta_{k}^{3}-\delta_{k-n}^{3}]\nonumber \\
& + & \frac{6}{n^{2}}(\delta_{k}-\delta_{k-n})^{2}[S_{2,k-1}(n)+\delta_{k}^{2}-\delta_{k-n}^{2}]-\frac{3}{n^{3}}(\delta_{k}-\delta_{k-n})^{4}
\end{eqnarray}
5.4. Cumulative calculations for $p=4$
Using the same trick and applying the condition (\ref{eq:w2c}), we
get:
\begin{equation}
S_{4,k}=S_{4,k-1}+\delta_{k}^{4}-\frac{4}{k}\delta_{k}[S_{3,k-1}+\delta_{k}^{3}]+\frac{6}{k^{2}}\delta_{k}^{2}[S_{2,k-1}+\delta_{k}^{2}]-\frac{3}{k^{3}}\delta_{k}^{4}
\end{equation}
or
\begin{equation}
S_{4,k}=S_{4,k-1}-\frac{4}{k}S_{3,k-1}\delta_{k}+\frac{6}{k^{2}}S_{2,k-1}\delta_{k}^{2}+\frac{(k-1)(k^{2}-3k+3)}{k^{2}}\delta_{k}^{4}
\end{equation}
6. Covariance
The above approach can be generalized for the calculation of the covariance
matrix elements.
6.1. Definition
The covariance of the components $x_{\mu}$ and $x_{\nu}$ is defined
as
\begin{equation}
c_{\mu\nu,k}(n)=\frac{1}{n}\sum_{i=k-n+1}^{k}\left[x_{\mu,i}-m_{\mu,k}(n)\right]\left[x_{\nu,i}-m_{\nu,k}(n)\right] = \frac{1}{n}C_{\mu\nu,k}(n)\label{eq:covw}
\end{equation}
Here $m_{\mu,k}(n)$ is the mean value for the $\mu$-th component.
It obeys the generalized equation (\ref{eq:recc}):
\begin{equation}
m_{\mu,k}(n)=m_{\mu,k-1}(n)+\frac{\delta_{\mu,k}-\delta_{\mu,k-n}}{n}\label{eq:mmunuk}
\end{equation}
with Eqs. (\ref{eq:dk}) and (\ref{eq:dkn}) modified accordingly.
6.2. Windowed calculations
Using the recurrence (\ref{eq:mmunuk}) for each component we have
\begin{equation}
C_{\mu\nu,k}(n)=\sum_{i=k-n+1}^{k}\left[x_{\mu,i}-m_{\mu,k-1}(n)-\frac{\delta_{\mu,k}-\delta_{\mu,k-n}}{n}\right]\left[x_{\nu,i}-m_{\nu,k-1}(n)-\frac{\delta_{\nu,k}-\delta_{\nu,k-n}}{n}\right]
\end{equation}
The above equation can be rewritten as
\begin{equation}
C_{\mu\nu,k}(n)=\sum_{i=k-n+1}^{k}\left\{ \begin{array}{c}
\left[x_{\mu,i}-m_{\mu,k-1}(n)\right]\left[x_{\nu,i}-m_{\nu,k-1}(n)\right]-\left[x_{\mu,i}-m_{\mu,k-1}(n)\right]\frac{\delta_{\nu,k}-\delta_{\nu,k-n}}{n}\\
\frac{\delta_{\mu,k}-\delta_{\mu,k-n}}{n}\left[x_{\nu,i}-m_{\nu,k-1}(n)\right]+\frac{\delta_{\mu,k}-\delta_{\mu,k-n}}{n}\,\frac{\delta_{\nu,k}-\delta_{\nu,k-n}}{n}
\end{array}\right\}
\end{equation}
or
\begin{equation}
C_{\mu\nu,k}(n)=C_{\mu\nu,k-1}(n)+\delta_{\mu,k}\delta_{\nu,k}-\delta_{\mu,k-n}\delta_{\nu,k-n}-\frac{(\delta_{\mu,k}-\delta_{\mu,k-n})(\delta_{\nu,k}-\delta_{\nu,k-n})}{n}\label{eq:Cmunuk}
\end{equation}
This gives the recurrence formula for the windowed online covariance
calculations.
6.3. Cumulative calculations
Applying the conditions (\ref{eq:w2c}) to each component, we get:
\begin{eqnarray}
C_{\mu\nu,k} & = & C_{\mu\nu,k-1}+\delta_{\mu,k}\delta_{\nu,k}-\frac{\delta_{\mu,k}\delta_{\nu,k}}{k}\nonumber \\
& = & C_{\mu\nu,k-1}+\delta_{\mu,k}\delta_{\nu,k}-\frac{k-1}{k}\delta_{\mu,k}\delta_{\nu,k}
\end{eqnarray}
This gives the recurrence formula for the cumulative online covariance
calculations. Note that for $\mu=\nu$ the above equations reduce
to the variance formulas.
7. Implementation
My Python implementation can be found on GitHub: https://github.com/dr-nikolai/online_stat.
It also includes the Jupyter notebook with a test/demo.
Note that this code is not fully optimized. It is primarily an illustration to this article.
Acknowledgement
I am grateful to Marko Draisma who pointed to a typo in Eq. 18.
Suggested citation for this article:
Nikolai Shokhirev, 2013. Single-Pass Online Statistical Algorithms,
http://www.numericalexpert.com/articles/single_pass_stat
References
-
Chan, T. F.; Golub, G. H. and LeVeque, R. J., Updating
formulae and a pairwise algorithm for computing sample variances,
COMPSTAT 1982 5th Symposium held at Toulouse 1982, 1982, 30-412
-
Chan, T. F.; Golub, G. H. and LeVeque, R. J., Algorithms
for computing the sample variance: Analysis and recommendations The
American Statistician, Taylor & Francis Group, 1983, 37, 242-247
-
Datar, M.; Gionis, A.; Indyk, P. and Motwani, R., Maintaining
stream statistics over sliding windows, SIAM Journal on Computing,
SIAM, 2002, 31, 1794-1813
-
Zivot, E. and Wang, J., Rolling Analysis of Time Series,
Modeling Financial Time Series with S-Plus, Springer, 2003, 299-346
-
Pébay, P.,
Formulas for robust, one-pass parallel
computation of covariances and arbitrary-order statistical moments.
Sandia National Laboratories, 2008
-
Alexander, C., Moving Average Models for Volatility
and Correlation, and Covariance Matrices, in Handbook of Finance,
John Wiley & Sons, Inc., 2008, 2-14
-
Finch, T., Incremental calculation of weighted mean
and variance, University of Cambridge, 2009, 4, 11-5
-
Bennett, J.; Grout, R.; Pébay, P.; Roe, D. and Thompson, D.,
Numerically stable, single-pass, parallel statistics algorithms,
Cluster Computing and Workshops, 2009. CLUSTER'09. IEEE International
Conference on, 2009, 1-8
-
Choi, M. and Sweetman, B., Efficient calculation of
statistical moments for structural health monitoring. Structural Health
Monitoring, SAGE Publications, 2010, 9, 13-24
-
McCrary, S., Implementing Algorithms to Measure Common
Statistics, Available at SSRN 2695198, 2015
-
Wikipedia,
Algorithms for calculating variance, 2016.
-
Cook, J. D.,
Accurately computing running variance, 2008.
-
Cook, J. D.,
Computing skewness and kurtosis in one pass, 2008.
-
TurnerSR,
Single-pass, parallel statistics algorithms
for mean, variance, and standard deviation, 2014.
-
Jenks, G.,
RunStats: Computing Statistics and Regression in One Pass, 2015.
-
Babcock, B.; Datar, M.; Motwani, R. and O'Callaghan, L.
Maintaining variance and k-medians over data stream windows Proceedings
of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles
of database systems, 2003, 234-243
-
Wikipedia, Moving average, 2016.