I need to compute cosine similarity on a scipy.sparse.csr.csr_matrix. With the most straightforward sklearn implementation I'm running into memory errors with larger matrix shapes. And even for smaller shapes the performance is not great and CPU load doesn't really exceed 25%. So I'd like to make this faster and stable also for larger data sets.
I found a really great resource about the speed issue and I removed all but the fastest version in the original post and added my simple sklearn implementation as "Method 2". I confirmed that on my machine (32GB RAM, WIN10, Python 3.6.4) "Method 1" only runs about 4% of the time which is required for "Method 2" using the data set constructed within the code. Below is the adapted code from zbinsd:
# Code adopted from zbinsd @ https://stackoverflow.com/questions/17627219/whats-the-fastest-way-in-python-to-calculate-cosine-similarity-given-sparse-mat?rq=1
# Imports
import numpy as np
import scipy.sparse as sp
from scipy.spatial.distance import squareform, pdist
from sklearn.metrics.pairwise import linear_kernel
from sklearn.preprocessing import normalize
from sklearn.metrics.pairwise import cosine_similarity
# Create an adjacency matrix
np.random.seed(42)
A = np.random.randint(0, 2, (10000, 100)).astype(float).T
# Make it sparse
rows, cols = np.where(A)
data = np.ones(len(rows))
Asp = sp.csr_matrix((data, (rows, cols)), shape = (rows.max()+1, cols.max()+1))
print("Input data shape:", Asp.shape)
# Define a function to calculate the cosine similarities a few different ways
def calc_sim(A, method=1):
if method == 1:
similarity = np.dot(A, A.T)
# squared magnitude of preference vectors (number of occurrences)
square_mag = np.diag(similarity)
# inverse squared magnitude
inv_square_mag = 1 / square_mag
# if it doesn't occur, set it's inverse magnitude to zero (instead of inf)
inv_square_mag[np.isinf(inv_square_mag)] = 0
# inverse of the magnitude
inv_mag = np.sqrt(inv_square_mag)
# cosine similarity (elementwise multiply by inverse magnitudes)
cosine = similarity * inv_mag
return cosine.T * inv_mag
if method == 2:
return cosine_similarity(Asp)
# Assert that all results are consistent with the first model ("truth")
for m in range(1, 3):
if m in [2]: # The sparse case
np.testing.assert_allclose(calc_sim(A, method=1), calc_sim(Asp, method=m))
else:
np.testing.assert_allclose(calc_sim(A, method=1), calc_sim(A, method=m))
# Time them:
print("Method 1")
%timeit calc_sim(A, method=1)
print("Method 2")
%timeit calc_sim(A, method=2)
I also found a good resource about the memory issue and turns out I had already taken into account icm's proposal using unique occurrences only so not sure how to further improve on that.
Moving to my original data stemming from an sklearn count vectorizer
TFvectorizer = CountVectorizer(lowercase=False, tokenizer=log_tokenizer, ngram_range=(1,1))
TF = TFvectorizer.fit_transform(unique_msgs)
all_msgs_vect = TFvectorizer.transform(all_msgs)
I'm left with two problems:
Problem #1: With a small sample of my original data set Method 1 is faster than Method 2 but both methods don't really use more than ~25% of the CPU resources
In [1]: type(all_msgs_vect)
Out[1]: scipy.sparse.csr.csr_matrix
In [2]: all_msgs_vect.shape
Out[2]: (5000, 529)
# Method 1
In [3]: start = datetime.now()
...: print(datetime.now())
...: msg_CosSim = cosine_similarity(all_msgs_vect)
...: print('Method 1 took', datetime.now() - start)
2019-09-09 10:44:33.039660
Method 1 took 0:00:00.117537
# Method 2
In [4]: start = datetime.now()
...: similarity = np.dot(all_msgs_vect.toarray(), all_msgs_vect.toarray().T)
...: square_mag = np.diag(similarity)
...: inv_square_mag = 1 / square_mag
...: inv_square_mag[np.isinf(inv_square_mag)] = 0
...: inv_mag = np.sqrt(inv_square_mag)
...: cosine = similarity * inv_mag
...: msg_CosSim2 = cosine.T * inv_mag
...: print('Method 2 took', datetime.now() - start)
Method 2 took 0:00:08.399767
__main__:4: RuntimeWarning: divide by zero encountered in true_divide
Any idea why unlike in the example from zbinsds with my data his proposed method is actually slower and how I could make use of the idling 75% of my CPU resources?
Problem #2: With a large sample of my original data I'm running into memory errors for both methods where "Method 1" never exceeds memory loads of ~20% and "Method 2" shortly peaks to ~60% before producing the error.
In [2]: all_msgs_vect.shape
Out[2]: (1063867, 3128)
In [3]: start = datetime.now()
...: msg_CosSim = cosine_similarity(all_msgs_vect)
...: print('Method 1 took', datetime.now() - start)
...:
2019-09-09 11:13:53.808270
Traceback (most recent call last):
File "<ipython-input-3-11dcc36bb82a>", line 3, in <module>
msg_CosSim = cosine_similarity(all_msgs_vect)
File "C:\Users\her1dr\AppData\Local\conda\conda\envs\dev\lib\site-packages\sklearn\metrics\pairwise.py", line 925, in cosine_similarity
K = safe_sparse_dot(X_normalized, Y_normalized.T, dense_output=dense_output)
File "C:\Users\her1dr\AppData\Local\conda\conda\envs\dev\lib\site-packages\sklearn\utils\extmath.py", line 135, in safe_sparse_dot
ret = a * b
File "C:\Users\her1dr\AppData\Local\conda\conda\envs\dev\lib\site-packages\scipy\sparse\base.py", line 440, in __mul__
return self._mul_sparse_matrix(other)
File "C:\Users\her1dr\AppData\Local\conda\conda\envs\dev\lib\site-packages\scipy\sparse\compressed.py", line 502, in _mul_sparse_matrix
indices = np.empty(nnz, dtype=idx_dtype)
MemoryError
In [4]: start = datetime.now()
...: similarity = np.dot(all_msgs_vect.toarray(), all_msgs_vect.toarray().T)
...: square_mag = np.diag(similarity)
...: inv_square_mag = 1 / square_mag
...: inv_square_mag[np.isinf(inv_square_mag)] = 0
...: inv_mag = np.sqrt(inv_square_mag)
...: cosine = similarity * inv_mag
...: msg_CosSim2 = cosine.T * inv_mag
...: print('Method 2 took', datetime.now() - start)
Traceback (most recent call last):
File "<ipython-input-4-070750736bc5>", line 2, in <module>
similarity = np.dot(all_msgs_vect.toarray(), all_msgs_vect.toarray().T)
MemoryError
Any idea how I can deal with these data volumes at all making use of all my available memory? I got the vague feeling that in "Method 2" the .toarray() is a problem but how to avoid it? Simply getting rid of it does not solve the memory issue and also I'm not sure the dot natrix calculation is still working properly in this case:
In [5]: similarity = np.dot(all_msgs_vect, all_msgs_vect.T)
Traceback (most recent call last):
File "<ipython-input-5-e006c93b9bfd>", line 1, in <module>
similarity = np.dot(all_msgs_vect, all_msgs_vect.T)
File "C:\Users\her1dr\AppData\Local\conda\conda\envs\dev\lib\site-packages\scipy\sparse\base.py", line 440, in __mul__
return self._mul_sparse_matrix(other)
File "C:\Users\her1dr\AppData\Local\conda\conda\envs\dev\lib\site-packages\scipy\sparse\compressed.py", line 502, in _mul_sparse_matrix
indices = np.empty(nnz, dtype=idx_dtype)
MemoryError
I hope I gave enough info about my original data since I can't really upload it here but if not, pls let me know! Any input highly appreciated...
Thanks, Mark