Distributed cross correlation matrix computation

Question

How can I calculate pearson cross correlation matrix of large (>10TB) data set, possibly in distributed manner ? Any efficient distributed algorithm suggestion will be appreciated.

update: I read the implementation of apache spark mlib correlation

Pearson Computaation:
/home/d066537/codespark/spark/mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/Correlation.scala
Covariance Computation:
/home/d066537/codespark/spark/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala

but for me it looks like all the computation is happening at one node and it is not distributed in real sense.

Please put some light in here. I also tried executing it on a 3 node spark cluster and below are the screenshot:

As you can see from 2nd image that data is pulled up at one node and then computation is being done.Am i right in here ?

score 5 · Answer 1 · edited May 23 '17 at 11:54

5

To start with, have a look at this to see if things are going right. You may then refer to any of these implementations: MPI/OpenMP: Agomezl or Meismyles, MapReduce: Vangjee or Seawolf42. It'd also be interesting to read this before you proceed. On a different note, James's thesis provides some pointers if you're interested in computing the correlations that are robust to outliers.

edited May 23 '17 at 11:54

Community

1
1

answered Feb 23 '17 at 01:02

dangiankit

119
4

Thanks for pointing me to the James's Thesis. It would be great if you can answer this too: http://stackoverflow.com/questions/42428424/how-to-calculate-mean-of-distributed-data – Kumar Roshan Mehta Feb 24 '17 at 00:44
James thesis talks about Maronna and Quadrant covariance computation but I could not able to understand these 2 algorithm, Do you know any link where these 2 algorithms are explained. – Kumar Roshan Mehta Mar 23 '17 at 16:51

score 0 · Answer 2 · answered Apr 01 '19 at 14:33

0

Each local data sets can converted into stdv and covariances. Also stdv and covariance and sum make correlation.

This is working example https://github.com/jeesim2/distributed-correlation

answered Apr 01 '19 at 14:33

Jihun No

1,201
1
14
29

Distributed cross correlation matrix computation

2 Answers2

Linked