The correlation between two vectors of data is cor(X,Y) = cov(X,Y)/[sd(X) * sd(Y)]
. Is there any way to break these up into block computations? The essential computation required (since sd(X) = sqrt(cov(X,X)
) is
cov(X,Y) = <X Y> - <X> <Y>
= 1/N (sum[i] X[i] Y[i]) - 1/N (sum[i] X[i]) * 1/N (sum[i] Y[i])
This is a sum over all indices i. Each index i, however, corresponds to a node n with N_n
events and a sub-index (in that node) k_n
:
cov(X,Y) = 1/N (sum[n] sum[k_n] X[k_n] Y[k_n])
- 1/N^2 (sum[n] sum[k_n] X[k_n]) * (sum[n] sum[k_n] Y[i])
Since N = sum[n] N_n
, this can be rewritten as
cov(X,Y) = (sum[n] N_n/N 1/N_n sum[k_n] X[k_n] Y[k_n])
- (sum[n] N_n/N 1/N_n sum[k_n] X[k_n]) * (sum[n] N_n/N 1/N_n sum[k_n] Y[i])
= (sum[n] N_n/N <XY>_n) - (sum[n] N_n/N <X>_n) * (sum[n] N_n/N <Y>_n)
So, each node need only report its number of entries N_n
and the means <X>_n, <Y>_n
, and <XY>_n
(and, for the purposes of the correlation, <X^2>_n
and <Y^2>_n
) within the node. The global covariance can then be calculated via summing these means together with the appropriate weights N_n/N
(where again N = sum[n] N_n
) to get the global means.
Edit: LaTeX version
Since these equations are hard to parse without LaTeX, here are some more understandable image versions. The covariance of two lists of data X and Y is defined to be

where each quantity <X>, <Y>
, and <XY>
is a mean (of the list X, the list Y, and the pairwise product list XY). The computation of the means can be broken down as a weighted sum over the various nodes. Calling any of X, Y, XY, or X^2 or Y^2 (necessary to compute the correlation) Z, the mean of Z is:

where <Z>_k
is the mean of Z on the k-th node and N_k
is the number of data points in the k-th node. This reduces the amount of information needed from each node to N_k, <X>_k, <Y>_k, <XY>_k, <X^2>_k
, and <Y^2>_k
.