I am working with large sparse binary matrices. I have condensed them using Scipy
sparse matrix implementation. The calculation of Jaccard distance
from scipy.spatial.distance
does not support direct operation on sparse matrices, so either:
convert the entire sparse matrix to dense and then operate on each row as vectors which is memory hungry
or
Loop through the sparse, grab each row using
getrow()
and operate.or
Write our own implementation to work on sparse matrices.
To put things to perspective, here is the sample code:
import scipy.spatial.distance as d
import numpy as np
from scipy.sparse import csr_matrix
# benchmark performance
X = np.random.random((3000, 3000))
# binarize
X[X > 0.3] = 0
X[X>0] = 1
mat = csr_matrix(X)
a = np.zeros(3000)
a[4] = a[100] = a[22] =1
a = csr_matrix(a)
def jaccard_fast(v1,v2):
common = v1.dot(v2.T)
dis = (v1 != v2).getnnz()
if common[0,0]:
return 1.0-float(common[0,0])/float(common[0,0]+dis)
else:
return 0.0
def benchmark_jaccard_fast():
for i in range(mat.shape[0]):
jaccard_fast(mat.getrow(i),a)
def benchmark_jaccard_internal_todense():
for v1,v2 in zip(mat.todense(),a.todense()):
d.jaccard(v1,v2)
def benchmark_jaccard_internal_getrow():
for i in range(mat.shape[0]):
d.jaccard(mat.getrow(i),a)
print "Jaccard Fast:"
%time benchmark_jaccard_fast()
print "Jaccard Scipy (expanding to dense):"
%time benchmark_jaccard_internal_todense()
print "Jaccard Scipy (using getrow):"
%time benchmark_jaccard_internal_getrow()
where jaccard_fast
is my own implementation. It appears that my implementation is faster than the internal one, on scipy sparse matrices, however getrow()
seems to slow my implementation down. As I benchmark jaccard_fast
against scipy.spatial.distance.jaccard
, results are:
Jaccard Fast:
CPU times: user 1.28 s, sys: 0 ns, total: 1.28 s
Wall time: 1.28 s
Jaccard Scipy (expanding to dense):
CPU times: user 28 ms, sys: 8 ms, total: 36 ms
Wall time: 37.2 ms
Jaccard Scipy (using getrow):
CPU times: user 1.82 s, sys: 0 ns, total: 1.82 s
Wall time: 1.81 s
Any help on how to avoid the getrow
bottleneck would be appreciated. I cannot afford to expand my sparse matrix using todense()
due to memory limitations.