0

I'm trying to calculate cosine similarity of a sparse matrix

<63671x30 sparse matrix of type '<class 'numpy.uint8'>'
    with 131941 stored elements in Compressed Sparse Row format>

The thing is I used scikit-learn's cosine_similarity function but I got this error: memoryError: Unable to allocate 29.7 GiB for an array with shape (3984375099,) and data type float64

I googled the error where I was suggested to increase the size of the paging file, but after doing it my PC just freezes and I have to force shutdown and reboot. Is there any way to overcome this?

shyam
  • 1
  • 4
  • can you add more details about the input? Shape, how dense it is etc – Marat Jul 15 '20 at 23:58
  • @Marat it's not very dense basically it's a vectorization of 32 classes and shape of the matrix is already in the post. – shyam Jul 16 '20 at 00:15
  • 1
    well, you have 130K+ items, thus billions of pairs, and sparsity of features doesn't help at all. You need to think about something else about this problem that can make this manageable. – Marat Jul 16 '20 at 00:46
  • Show the exact call, and error traceback. – hpaulj Jul 16 '20 at 00:53
  • What range of values does your matrix take? Just binary or more? Also, considering how sparse your data are, the error msg suggests that the product is much denser than expected. That seems to indicate that a small number of features is much more frequent than the others. Could you comment on that? It may help designing a solution. – Paul Panzer Jul 16 '20 at 06:00

1 Answers1

1

Inspiration from: Link

Try doing cosine similarity in chunk wise manner i.e take n number of rows and calculate their cosine similarity with the whole matrix.

from scipy import sparse
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity


def cosine_similarity_n_space(m1, m2, batch_size=100):
    assert m1.shape[1] == m2.shape[1] and isinstance(batch_size, int) == True

    ret = np.ndarray((m1.shape[0], m2.shape[0]))

    batches = m1.shape[0] // batch_size
    
    if m1.shape[0]%batch_size != 0:
        batches = batches + 1  

    for row_i in range(0, batches):
        start = row_i * batch_size
        end = min([(row_i + 1) * batch_size, m1.shape[0]])        
        rows = m1[start: end]
        sim = cosine_similarity(rows, m2)  
        ret[start: end] = sim
    
    return ret


A = np.array([[0, 1, 0, 0, 1], [0, 0, 1, 1, 1], [1, 1, 0, 1, 0]])
A_sparse = sparse.csr_matrix(A)

similarities = cosine_similarity(A_sparse)
chunk_wise_similarity = cosine_similarity_n_space(A_sparse, A_sparse)

comparison = similarities == chunk_wise_similarity
equal_arrays = comparison.all()

print(equal_arrays)
Kartikey Singh
  • 864
  • 10
  • 23