I have a very large scipy sparse csr matrix. It is a 100,000x2,000,000 dimensional matrix. Let's call it X
. Each row is a sample vector in a 2,000,000 dimensional space.
I need to calculate the cosine distances between each pair of samples very efficiently. I have been using sklearn pairwise_distances
function with a subset of vectors in X
which gives me a dense matrix D: the square form of the pairwise distances which contains redundant entries. How can I use sklearn pairwise_distances
to get the condensed form directly? Please refer to http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html to see what the condensed form is. It is the output of scipy pdist
function.
I have memory limitations and I can't calculate the square form and then get the condensed form. Due to memory limitations, I also cannot use scipy pdist
as it requires a dense matrix X
which does not again fit in memory. I thought about looping through different chunks of X
and calculate the condensed form for each chunk and join them together to get the complete condensed form, but this is relatively cumbersome. Any better ideas?
Any help is much much appreciated. Thanks in advance.
Below is a reproducible example (of course for demonstration purposes X
is much smaller):
from scipy.sparse import rand
from scipy.spatial.distance import pdist
from sklearn.metrics.pairwise import pairwise_distances
X = rand(1000, 10000, density=0.01, format='csr')
dist1 = pairwise_distances(X, metric='cosine')
dist2 = pdist(X.A, 'cosine')
As you see dist2
is in the condensed form and is a 499500 dimensional vector. But dist1
is in the symmetric squareform and is a 1000x1000 matrix.