How to calculate a very large correlation matrix

Question

I have an np.array of observations z where z.shape is (100000, 60). I want to efficiently calculate the 100000x100000 correlation matrix and then write to disk the coordinates and values of just those elements > 0.95 (this is a very small fraction of the total).

My brute-force version of this looks like the following but is, not surprisingly, very slow:

for i1 in range(z.shape[0]):
    for i2 in range(i1+1):
        r = np.corrcoef(z[i1,:],z[i2,:])[0,1]
        if r > 0.95:
            file.write("%6d %6d %.3f\n" % (i1,i2,r))

I realize that the correlation matrix itself could be calculated much more efficiently in one operation using np.corrcoef(z), but the memory requirement is then huge. I'm also aware that one could break up the data set into blocks and calculate bite-size subportions of the correlation matrix at one time, but programming that and keeping track of the indices seems unnecessarily complicated.

Is there another way (e.g., using memmap or pytables) that is both simple to code and doesn't put excessive demands on physical memory?

It's unnecessarily complicated if there's a pythonic way to do this that requires significantly less coding effort and is thus also less prone to coding errors. — Grant Petty, Sep 20 '18 at 15:25

Grant Petty · Accepted Answer · 2018-09-24T02:06:12.577

After experimenting with the memmap solution proposed by others, I found that while it was faster than my original approach (which took about 4 days on my Macbook), it still took a very long time (at least a day) -- presumably due to inefficient element-by-element writes to the outputfile. That wasn't acceptable given my need to run the calculation numerous times.

In the end, the best solution (for me) was to sign in to Amazon Web Services EC2 portal, create a virtual machine instance (starting with an Anaconda Python-equipped image) with 120+ GiB of RAM, upload the input data file, and do the calculation (using the matrix multiplication method) entirely in core memory. It completed in about two minutes!

For reference, the code I used was basically this:

import numpy as np
import pickle
import h5py

# read nparray, dimensions (102000, 60)

infile = open(r'file.dat', 'rb')
x = pickle.load(infile)
infile.close()     

# z-normalize the data -- first compute means and standard deviations
xave = np.average(x,axis=1)
xstd = np.std(x,axis=1)

# transpose for the sake of broadcasting (doesn't seem to work otherwise!)
ztrans = x.T - xave
ztrans /= xstd

# transpose back
z = ztrans.T

# compute correlation matrix - shape = (102000, 102000)
arr = np.matmul(z, z.T)   
arr /= z.shape[0]

# output to HDF5 file
with h5py.File('correlation_matrix.h5', 'w') as hf:
        hf.create_dataset("correlation",  data=arr)

user2699 · Answer 2 · 2018-09-20T17:12:50.190

2

From my rough calculations, you want a correlation matrix that has 100,000^2 elements. That takes up around 40 GB of memory, assuming floats.
That probably won't fit in computer memory, otherwise you could just use corrcoef. There's a fancy approach based on eigenvectors that I can't find right now, and that gets into the (necessarily) complicated category... Instead, rely on the fact that for zero mean data the covariance can be found using a dot product.

z0 = z - mean(z, 1)[:, None]
cov = dot(z0, z0.T)
cov /= z.shape[-1]

And this can be turned into the correlation by normalizing by the variances

sigma = std(z, 1)
corr = cov
corr /= sigma
corr /= sigma[:, None]

Of course memory usage is still an issue. You can work around this with memory mapped arrays (make sure it's opened for reading and writing) and the out parameter of dot (For another example see Optimizing my large data code with little RAM)

N = z.shape[0]
arr = np.memmap('corr_memmap.dat', dtype='float32', mode='w+', shape=(N,N)) 
dot(z0, z0.T, out=arr)
arr /= sigma
arr /= sigma[:, None]

Then you can loop through the resulting array and find the indices with a large correlation coefficient. (You may be able to find them directly with where(arr > 0.95), but the comparison will create a very large boolean array which may or may not fit in memory).

edited Sep 20 '18 at 17:12

answered Sep 20 '18 at 15:42

user2699

2,927
14
31

I don't think calculating the correlation coefficient is as simple as taking the dot product. In addition to subtracting the mean, you'd have to z-normalize the data (divide by the variance of each variable) before taking the dot product and then divide the dot product by the number of samples N. – Grant Petty Sep 20 '18 at 15:48
@GrantPetty, I was thinking of the covariance matrix. You can normalize that with an extra step to get the correlation matrix, I've updated the answer to show that. – user2699 Sep 20 '18 at 16:20
2

I'm a little puzzled by your use of np.load with the 'r+' options. Shouldn't I use arr = np.memmap('corr_memmap.dat', dtype='float32', mode='w+', shape=(N,N)) – Grant Petty Sep 20 '18 at 16:33
@GrantPetty, yes, that's the right way to do it. If you have an existing file the `load` command works, but you need `memmap` and the datatype and shape to create the file. Thanks for catching that. – user2699 Sep 20 '18 at 17:16
I have been trying to make the memmap approach work, but I'm running into a puzzling error. 'arr' is a memmap array with type float32, 'C' ordering, correct dimensions. But when I go to compute the dot product, I get: ValueError: output array is not acceptable (must have the right type, nr dimensions, and be a C-Array) – Grant Petty Sep 20 '18 at 17:47
The error I described above appears to have been a bug in the version of numpy I was using. When I upgraded, it went away. – Grant Petty Sep 20 '18 at 18:17
This answer (https://stackoverflow.com/questions/46008310/why-does-outputing-numpy-dot-to-memmap-does-not-work) seems to address the same issue, if upgrading isn't an option for someone else. – user2699 Sep 20 '18 at 18:40
1

I'm finding that the memmap approach is not obviously helping speed things up relative to my original approach -- 24 hours later, the program is still grinding away, presumably because it's hitting the disk constantly as it performs the matrix multiplication. – Grant Petty Sep 21 '18 at 13:11
It shouldn't be hitting the disk to perform the multiplications, just to store the result. Some rough tests for me show that it would still take about two hours on my relatively fast computer to calculate the covariance without the disk access, so I can easily see disk access slowing that down even more. – user2699 Sep 21 '18 at 14:34

Daniel F · Answer 3 · 2018-09-20T16:07:48.560

You can use scipy.spatial.distance.pdist with metric = correlation to get all the correlations without the symmetric terms. Unfortunately this will still leave you with about 5e10 terms that will probably overflow your memory.

You could try reformulating a KDTree (which can theoretically handle cosine distance, and therefore correlation distance) to filter for higher correlations, but with 60 dimensions it's unlikely that would give you much speedup. The curse of dimensionality sucks.

You best bet is probably brute forcing blocks of data using scipy.spatial.distance.cdist(..., metric = correlation), and then keep only the high correlations in each block. Once you know how big a block your memory can handle without slowing down due to your computer's memory architecture it should be much faster than doing one at a time.

score 0 · Answer 4 · answered May 17 '21 at 07:07

please check out deepgraph package.

https://deepgraph.readthedocs.io/en/latest/tutorials/pairwise_correlations.html

I tried on z.shape = (2500, 60) and pearsonr for 2500 * 2500. It has an extreme fast speed.

Not sure for 100000 x 100000 but worth trying.

How to calculate a very large correlation matrix

4 Answers4

Linked