Distributed algorithm for calculation of Pearson cross correlation matrix partitioned by time and key

Question

What could be an algorithm for computation of Pearson cross-correlation matrix in a distributed environment where my data is divided by id (say: 1-4) and time (say: Jan-Dec) among different nodes.

For example:

Node A({id1, Jan}, {id2, Jan}); Node B({id3, Jan}, {id4, Jan}),
Node C({id1, Feb}, {id2, Feb}); Node A({id1, March}{id2, March}),
Node C({id3, Feb}, {id4, Feb}); Node B({id3, March}, {id4, March})

Basically, I meant to say Jan data for all id is not at one node.

I'm wondering what strategy I could use where I do not have to ship large data from one node to another node as Pearson correlation is a pairwise computation. I'm ok with just transferring small intermediate result between nodes. How should I partition my data based on id and time so that I efficiently calculate cross-correlation matrix among multiple ids.

The language of choice is C++

jwimberley · Answer 1 · 2017-05-26T12:23:10.367

The correlation between two vectors of data is cor(X,Y) = cov(X,Y)/[sd(X) * sd(Y)]. Is there any way to break these up into block computations? The essential computation required (since sd(X) = sqrt(cov(X,X)) is

cov(X,Y) = <X Y> - <X> <Y>
         = 1/N (sum[i] X[i] Y[i]) - 1/N (sum[i] X[i]) * 1/N (sum[i] Y[i])

This is a sum over all indices i. Each index i, however, corresponds to a node n with N_n events and a sub-index (in that node) k_n:

cov(X,Y) = 1/N (sum[n] sum[k_n] X[k_n] Y[k_n])
         - 1/N^2 (sum[n] sum[k_n] X[k_n]) * (sum[n] sum[k_n] Y[i])

Since N = sum[n] N_n, this can be rewritten as

cov(X,Y) = (sum[n] N_n/N 1/N_n sum[k_n] X[k_n] Y[k_n])
         - (sum[n] N_n/N 1/N_n sum[k_n] X[k_n]) * (sum[n] N_n/N 1/N_n sum[k_n] Y[i])
         = (sum[n] N_n/N <XY>_n) - (sum[n] N_n/N <X>_n) * (sum[n] N_n/N <Y>_n)

So, each node need only report its number of entries N_n and the means <X>_n, <Y>_n, and <XY>_n (and, for the purposes of the correlation, <X^2>_n and <Y^2>_n) within the node. The global covariance can then be calculated via summing these means together with the appropriate weights N_n/N (where again N = sum[n] N_n) to get the global means.

Edit: LaTeX version

Since these equations are hard to parse without LaTeX, here are some more understandable image versions. The covariance of two lists of data X and Y is defined to be

where each quantity <X>, <Y>, and <XY> is a mean (of the list X, the list Y, and the pairwise product list XY). The computation of the means can be broken down as a weighted sum over the various nodes. Calling any of X, Y, XY, or X^2 or Y^2 (necessary to compute the correlation) Z, the mean of Z is:

where <Z>_k is the mean of Z on the k-th node and N_k is the number of data points in the k-th node. This reduces the amount of information needed from each node to N_k, <X>_k, <Y>_k, <XY>_k, <X^2>_k, and <Y^2>_k.

I could not understand it. May you explain a little more, maybe with some pictures. — Kumar Roshan Mehta, May 26 '17 at 09:52
@RoshanMehta This is one of the two standard forms of the covariance formula, which unfortunately is hard to write because stackoverflow mysteriously has no LaTeX support. N = sum[n] N_n is saying that the total number of events N is the sum of the events N_n in each node. Just remember, in maybe easier notation: cov(X,Y) = mean(XY) - mean(X)*mean(Y), the mean of data residing on different nodes is equal to a weighted mean of the means on each node. — jwimberley, May 26 '17 at 12:00

score 0 · Answer 2 · answered May 29 '17 at 18:34

Take a look at this article, as it could explain it a little further: https://en.wikipedia.org/wiki/Covariance_matrix

Let's take two variables X and Y that you measured, meaning that you can provide two arrays of the same length so that {x_i} are the measured values of X and {y_i} are the measured values of Y.

From a philosophical point of view, the covariance of two variables X and Y expresses how strong is the probability that to the variation of X corresponds a variation of Y.

To compute the covariance matrix you need three elements:

is the arithmetic average of X
is the arithmetic average of Y
is the arithmetic average of the element-wise product of the arrays X and Y

Provided that cov(X,Y) = cov(Y,X) and that cov(X,X) = cov(Y,Y) = 1, you can use these properties to minimize the computation required and the need for data to be transfered, for you only need to compute the elements in the upper diagonal of the matrix.

For instance, if you have two variables you have to compute only one element, for three variables you need to compute 3 elements, and so on...

score 0 · Answer 3 · answered May 30 '17 at 15:27

Will this paper help you? https://pdfs.semanticscholar.org/f02f/0df4922351375aa304de7de296393cdf7224.pdf

"The first algorithm is a parallel version of Quadrant Correlation (QC), and the second is a parallel version of the Maronna method. Parallel QC uses a parallel matrix library and can handle single-dimensional outliers in its data. The parallel Maronna method divides the independent correlation calculations between the processors and is capable of detecting one and two dimensional outliers in data."

Another similar question: Distributed cross correlation matrix computation

This paper actually sucks. Doesn't tell how, why and takes lot assumption. — Kumar Roshan Mehta, May 30 '17 at 15:29

Distributed algorithm for calculation of Pearson cross correlation matrix partitioned by time and key

3 Answers3