Fast way to compute cosine similarity w/o memory error on large data sets

Question

I need to compute cosine similarity on a scipy.sparse.csr.csr_matrix. With the most straightforward sklearn implementation I'm running into memory errors with larger matrix shapes. And even for smaller shapes the performance is not great and CPU load doesn't really exceed 25%. So I'd like to make this faster and stable also for larger data sets.

I found a really great resource about the speed issue and I removed all but the fastest version in the original post and added my simple sklearn implementation as "Method 2". I confirmed that on my machine (32GB RAM, WIN10, Python 3.6.4) "Method 1" only runs about 4% of the time which is required for "Method 2" using the data set constructed within the code. Below is the adapted code from zbinsd:

# Code adopted from zbinsd @ https://stackoverflow.com/questions/17627219/whats-the-fastest-way-in-python-to-calculate-cosine-similarity-given-sparse-mat?rq=1

# Imports
import numpy as np
import scipy.sparse as sp
from scipy.spatial.distance import squareform, pdist
from sklearn.metrics.pairwise import linear_kernel
from sklearn.preprocessing import normalize
from sklearn.metrics.pairwise import cosine_similarity

# Create an adjacency matrix
np.random.seed(42)
A = np.random.randint(0, 2, (10000, 100)).astype(float).T

# Make it sparse
rows, cols = np.where(A)
data = np.ones(len(rows))
Asp = sp.csr_matrix((data, (rows, cols)), shape = (rows.max()+1, cols.max()+1))

print("Input data shape:", Asp.shape)

# Define a function to calculate the cosine similarities a few different ways
def calc_sim(A, method=1):
    if method == 1:
        similarity = np.dot(A, A.T)
        # squared magnitude of preference vectors (number of occurrences)
        square_mag = np.diag(similarity)
        # inverse squared magnitude
        inv_square_mag = 1 / square_mag
        # if it doesn't occur, set it's inverse magnitude to zero (instead of inf)
        inv_square_mag[np.isinf(inv_square_mag)] = 0
        # inverse of the magnitude
        inv_mag = np.sqrt(inv_square_mag)
        # cosine similarity (elementwise multiply by inverse magnitudes)
        cosine = similarity * inv_mag
        return cosine.T * inv_mag
    if method == 2:
        return cosine_similarity(Asp)

# Assert that all results are consistent with the first model ("truth")
for m in range(1, 3):
    if m in [2]: # The sparse case
        np.testing.assert_allclose(calc_sim(A, method=1), calc_sim(Asp, method=m))
    else:
        np.testing.assert_allclose(calc_sim(A, method=1), calc_sim(A, method=m))

# Time them:
print("Method 1")
%timeit calc_sim(A, method=1)
print("Method 2")
%timeit calc_sim(A, method=2)

I also found a good resource about the memory issue and turns out I had already taken into account icm's proposal using unique occurrences only so not sure how to further improve on that.

Moving to my original data stemming from an sklearn count vectorizer

TFvectorizer = CountVectorizer(lowercase=False, tokenizer=log_tokenizer, ngram_range=(1,1))
TF = TFvectorizer.fit_transform(unique_msgs)
all_msgs_vect = TFvectorizer.transform(all_msgs)

I'm left with two problems:

Problem #1: With a small sample of my original data set Method 1 is faster than Method 2 but both methods don't really use more than ~25% of the CPU resources

In [1]: type(all_msgs_vect)
Out[1]: scipy.sparse.csr.csr_matrix

In [2]: all_msgs_vect.shape
Out[2]: (5000, 529)


# Method 1
In [3]: start = datetime.now()
   ...: print(datetime.now())
   ...: msg_CosSim = cosine_similarity(all_msgs_vect)
   ...: print('Method 1 took', datetime.now() - start)
2019-09-09 10:44:33.039660
Method 1 took 0:00:00.117537

# Method 2
In [4]: start = datetime.now()
   ...: similarity = np.dot(all_msgs_vect.toarray(), all_msgs_vect.toarray().T)
   ...: square_mag = np.diag(similarity)
   ...: inv_square_mag = 1 / square_mag
   ...: inv_square_mag[np.isinf(inv_square_mag)] = 0
   ...: inv_mag = np.sqrt(inv_square_mag)
   ...: cosine = similarity * inv_mag
   ...: msg_CosSim2 = cosine.T * inv_mag
   ...: print('Method 2 took', datetime.now() - start)
Method 2 took 0:00:08.399767
__main__:4: RuntimeWarning: divide by zero encountered in true_divide

Any idea why unlike in the example from zbinsds with my data his proposed method is actually slower and how I could make use of the idling 75% of my CPU resources?

Problem #2: With a large sample of my original data I'm running into memory errors for both methods where "Method 1" never exceeds memory loads of ~20% and "Method 2" shortly peaks to ~60% before producing the error.

In [2]: all_msgs_vect.shape
Out[2]: (1063867, 3128)

In [3]: start = datetime.now()
   ...: msg_CosSim = cosine_similarity(all_msgs_vect)
   ...: print('Method 1 took', datetime.now() - start)
   ...: 
2019-09-09 11:13:53.808270
Traceback (most recent call last):

  File "<ipython-input-3-11dcc36bb82a>", line 3, in <module>
    msg_CosSim = cosine_similarity(all_msgs_vect)

  File "C:\Users\her1dr\AppData\Local\conda\conda\envs\dev\lib\site-packages\sklearn\metrics\pairwise.py", line 925, in cosine_similarity
    K = safe_sparse_dot(X_normalized, Y_normalized.T, dense_output=dense_output)

  File "C:\Users\her1dr\AppData\Local\conda\conda\envs\dev\lib\site-packages\sklearn\utils\extmath.py", line 135, in safe_sparse_dot
    ret = a * b

  File "C:\Users\her1dr\AppData\Local\conda\conda\envs\dev\lib\site-packages\scipy\sparse\base.py", line 440, in __mul__
    return self._mul_sparse_matrix(other)

  File "C:\Users\her1dr\AppData\Local\conda\conda\envs\dev\lib\site-packages\scipy\sparse\compressed.py", line 502, in _mul_sparse_matrix
    indices = np.empty(nnz, dtype=idx_dtype)

MemoryError


In [4]: start = datetime.now()
   ...: similarity = np.dot(all_msgs_vect.toarray(), all_msgs_vect.toarray().T)
   ...: square_mag = np.diag(similarity)
   ...: inv_square_mag = 1 / square_mag
   ...: inv_square_mag[np.isinf(inv_square_mag)] = 0
   ...: inv_mag = np.sqrt(inv_square_mag)
   ...: cosine = similarity * inv_mag
   ...: msg_CosSim2 = cosine.T * inv_mag
   ...: print('Method 2 took', datetime.now() - start)
Traceback (most recent call last):

  File "<ipython-input-4-070750736bc5>", line 2, in <module>
    similarity = np.dot(all_msgs_vect.toarray(), all_msgs_vect.toarray().T)

MemoryError

Any idea how I can deal with these data volumes at all making use of all my available memory? I got the vague feeling that in "Method 2" the .toarray() is a problem but how to avoid it? Simply getting rid of it does not solve the memory issue and also I'm not sure the dot natrix calculation is still working properly in this case:

In [5]: similarity = np.dot(all_msgs_vect, all_msgs_vect.T)
Traceback (most recent call last):

  File "<ipython-input-5-e006c93b9bfd>", line 1, in <module>
    similarity = np.dot(all_msgs_vect, all_msgs_vect.T)

  File "C:\Users\her1dr\AppData\Local\conda\conda\envs\dev\lib\site-packages\scipy\sparse\base.py", line 440, in __mul__
    return self._mul_sparse_matrix(other)

  File "C:\Users\her1dr\AppData\Local\conda\conda\envs\dev\lib\site-packages\scipy\sparse\compressed.py", line 502, in _mul_sparse_matrix
    indices = np.empty(nnz, dtype=idx_dtype)

MemoryError

I hope I gave enough info about my original data since I can't really upload it here but if not, pls let me know! Any input highly appreciated...

Thanks, Mark

Very interesting finding! When I use zbinsd's modieifed code (the very top one) and modify the shape of the data like so ````python A = np.random.randint(0, 2, (100000, 1000)).astype(float).T ```` it works fine despite obviously being a lot slower. However, when I modify it like so ````python A = np.random.randint(0, 2, (1000, 100000)).astype(float).T ```` I can provoke the memory errors as well. — MarkH, Sep 10 '19 at 08:05
Did some more research based on this finding and maybe I'm really just running out of memory despite the fact that the memory load according to the task manager is so low? Anyways, I will try to run this in chunks as suggested [here](https://stackoverflow.com/questions/40900608/cosine-similarity-on-large-sparse-matrix-with-numpy/45202988) and [here](https://medium.com/@AgenceSkoli/how-to-avoid-memory-overloads-using-scikit-learn-f5eb911ae66c) and post the outcome... — MarkH, Sep 11 '19 at 06:49
Update: running it in chunks helps to avoid memory errors. I'm still not clear though why data of shape aXb works while bXa doesn't so if anyone has a thought on that... — MarkH, Sep 17 '19 at 19:11

Fast way to compute cosine similarity w/o memory error on large data sets

I'm left with two problems:

Problem #1: With a small sample of my original data set Method 1 is faster than Method 2 but both methods don't really use more than ~25% of the CPU resources

Problem #2: With a large sample of my original data I'm running into memory errors for both methods where "Method 1" never exceeds memory loads of ~20% and "Method 2" shortly peaks to ~60% before producing the error.

0 Answers0