1

What would be the most python-esque way for calculating a correlation matrix of extremely large vectors?

For example, I have 23 vectors, each with length 40,000 (!). pandas.Dataframe.corr() method runs out of RAM for this.

Jenna Kwon
  • 1,212
  • 1
  • 12
  • 22
  • 2
    How sparse are they? Could you benefit from an approach like [this](https://stackoverflow.com/questions/19231268/correlation-coefficients-for-sparse-matrix-in-python)? If you can't benefit from a sparse representation, and it's Pearson method you need, try [this](https://stackoverflow.com/questions/3437513/finding-the-correlation-matrix)? – rtkaleta Jan 19 '17 at 22:51
  • 3
    You probably have it transposed, this should be an instantaneous calculation. Assuming you want a 23x23 correlation matrix, right? In pandas, you should have 40,000 rows and 23 columns. Numpy wants it the other way (23 rows, 40,000 columns). Do it in numpy, should be about 30x faster: `np.corrcoef(arr)`. – JohnE Jan 19 '17 at 23:16
  • @JohnE in Pandas, I had 23 rows and 40,000 columns (should it have been the other way around? does it matter?) but you are right, np.corrcoef did this instantaneously! Thank you! Why is pandas.Dataframe.corr() so slow? – Jenna Kwon Jan 19 '17 at 23:29
  • 1
    pandas just has more overhead than numpy. But it should still be quite fast here if you transpose first. Just not as fast as numpy but either way takes less than a second on my laptop – JohnE Jan 20 '17 at 00:48
  • 1
    Consider using [dask](http://dask.pydata.org) – Zeugma Jan 20 '17 at 03:24

0 Answers0