1

I have a very large scipy sparse csr matrix. It is a 100,000x2,000,000 dimensional matrix. Let's call it X. Each row is a sample vector in a 2,000,000 dimensional space.

I need to calculate the cosine distances between each pair of samples very efficiently. I have been using sklearn pairwise_distances function with a subset of vectors in X which gives me a dense matrix D: the square form of the pairwise distances which contains redundant entries. How can I use sklearn pairwise_distances to get the condensed form directly? Please refer to http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html to see what the condensed form is. It is the output of scipy pdist function.

I have memory limitations and I can't calculate the square form and then get the condensed form. Due to memory limitations, I also cannot use scipy pdist as it requires a dense matrix X which does not again fit in memory. I thought about looping through different chunks of X and calculate the condensed form for each chunk and join them together to get the complete condensed form, but this is relatively cumbersome. Any better ideas?

Any help is much much appreciated. Thanks in advance.

Below is a reproducible example (of course for demonstration purposes X is much smaller):

from scipy.sparse import rand
from scipy.spatial.distance import pdist
from sklearn.metrics.pairwise import pairwise_distances
X = rand(1000, 10000, density=0.01, format='csr')
dist1 = pairwise_distances(X, metric='cosine')
dist2 = pdist(X.A, 'cosine')

As you see dist2 is in the condensed form and is a 499500 dimensional vector. But dist1 is in the symmetric squareform and is a 1000x1000 matrix.

JRun
  • 669
  • 1
  • 10
  • 17
  • You'll need to add a concrete example; something that we can copy-n-paste and run. Obviously it won't run into the memory issues. But your verbal description is hard to follow unless we are working on exactly the same problem. I know the sparse matrix code well, but haven't worked with `sklearn`. So terminology like 'condensed form' is foreign. – hpaulj Jul 12 '16 at 19:02
  • @hpaulj It seems like everything gets asked on stackoverflow, eventually: http://stackoverflow.com/questions/13079563/how-does-condensed-distance-matrix-work-pdist – Warren Weckesser Jul 13 '16 at 02:54
  • There have also been questions about filling in a upper/lower triangle (or both) from a vector of values. – hpaulj Jul 13 '16 at 03:07
  • Searching on scikit-learn and sparse and distance turns things like http://stackoverflow.com/q/8956274/901925 – hpaulj Jul 13 '16 at 03:19
  • @hpaulj: sure. I added an example, plus some references to what the condensed form is. Condensed form is a general term used in linear algebra. For algorithms that deal with large matrices with particular structures, it is often much more efficient to represent the matrix in a condensed form using algebraic operations. There are a variety of condensed form representations, some produce block diagonal matrices using eigenvalue/eigenvectors. Here, because the pairwise distance matrix is symmetric, the simplest condensed form consists of just its upper (or lower) triangular elements. – JRun Jul 13 '16 at 18:22

1 Answers1

4

I dug into the code for both versions, and think I understand what both are doing.

Start with a small simple X (dense):

X = np.arange(9.).reshape(3,3)

pdist cosine does:

norms = _row_norms(X)
_distance_wrap.pdist_cosine_wrap(_convert_to_double(X), dm, norms)

where _row_norms is a row dot - using einsum:

norms = np.sqrt(np.einsum('ij,ij->i', X,X)

So this is the first place where X has to be an array.

I haven't dug into the cosine_wrap, but it appears to do (probably in cython)

xy = np.dot(X, X.T)
# or xy = np.einsum('ij,kj',X,X)

d = np.zeros((3,3),float)   # square receiver
d2 = []                     # condensed receiver
for i in range(3):
    for j in range(i+1,3):
         val=1-xy[i,j]/(norms[i]*norms[j])
         d2.append(val)
         d[j,i]=d[i,j]=val

print('array')
print(d)
print('condensed',np.array(d2))

from scipy.spatial import distance
d1=distance.pdist(X,'cosine')
print('    pdist',d1)

producing:

array
[[ 0.          0.11456226  0.1573452 ]
 [ 0.11456226  0.          0.00363075]
 [ 0.1573452   0.00363075  0.        ]]

condensed [ 0.11456226  0.1573452   0.00363075]
    pdist [ 0.11456226  0.1573452   0.00363075]

distance.squareform(d1) produces the same thing as my d array.

I can produce the same square array by dividing the xy dot product with the appropriate norm outer product:

dd=1-xy/(norms[:,None]*norms)
dd[range(dd.shape[0]),range(dd.shape[1])]=0 # clean up 0s

Or by normalizing X before taking dot product. This appears to be what the scikit version does.

Xnorm = X/norms[:,None]
1-np.einsum('ij,kj',Xnorm,Xnorm)

scikit has added some cython code to do faster sparse calculations (beyond those provided by sparse.sparse, but using the same csr format):

from scipy import sparse
Xc=sparse.csr_matrix(X)

# csr_row_norm - pyx of following
cnorm = Xc.multiply(Xc).sum(axis=1)
cnorm = np.sqrt(cnorm)
X1 = Xc.multiply(1/cnorm)  # dense matrix
dd = 1-X1*X1.T

To get a fast condensed form with sparse matrices I think you need to implement a fast condensed version of X1*X1.T. That means you need to understand how the sparse matrix multiplication is implemented - in c code. The scikit cython 'fast sparse' code might also give ideas.

numpy has some tri... functions which are straight forward Python code. It does not attempt to save time or space by implementing tri calculations directly. It's easier to iterate over the rectangular layout of a nd array (with shape and strides) than to do the more complex variable length steps of a triangular array. The condensed form only cuts the space and calculation steps by half.

============

Here's the main part of the c function pdist_cosine, which iterates over i and the upper j, calculating dot(x[i],y[j])/(norm[i]*norm[j]).

for (i = 0; i < m; i++) {
    for (j = i + 1; j < m; j++, dm++) {
        u = X + (n * i);
        v = X + (n * j);
        cosine = dot_product(u, v, n) / (norms[i] * norms[j]);
        if (fabs(cosine) > 1.) {
            /* Clip to correct rounding error. */
            cosine = npy_copysign(1, cosine);
        }
        *dm = 1. - cosine;
    }
}

https://github.com/scipy/scipy/blob/master/scipy/spatial/src/distance_impl.h

hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • Thank you for such a comprehensive response. I have to try to understand the cython code!! Let's see... – JRun Jul 16 '16 at 01:37