0

hope everyone's well. I'm trying to use the following method to efficiently calculate cosine similarity of a (29805, 40) sparse matrix, created by HashingVectorizing (Sklearn) my dataset. The method below is originally from @Waylon Flinn's answer to this question.

def cosine_sim(A):

    similarity = np.dot(A, A.T)

    # squared magnitude of preference vectors (number of occurrences)
    square_mag = np.diag(similarity)

    # inverse squared magnitude
    inv_square_mag = 1 / square_mag

    # if it doesn't occur, set it's inverse magnitude to zero (instead of inf)
    inv_square_mag[np.isinf(inv_square_mag)] = 0

    # inverse of the magnitude
    inv_mag = np.sqrt(inv_square_mag)

    # cosine similarity (elementwise multiply by inverse magnitudes)
    cosine = similarity * inv_mag
    return cosine.T * inv_mag

When I try with a dummy matrix, everything works fine.

A = np.random.randint(0, 2, (10000, 100)).astype(float)
cos_sim = cosine_sim(A)

but when I try with my own matrix..

cos_sim = cosine_sim(sparse_matrix)

I get

ValueError: Input must be 1- or 2-d.

Now, calling .shape on my matrix returns (29805, 40). How is that not 2-d? Can someone tell me what I'm doing wrong here? The error occurs here (from jupyter notebook traceback):

----> 6     square_mag = np.diag(similarity)

Thanks for reading! For context, calling sparse_matrix returns this

<29805x40 sparse matrix of type '<class 'numpy.float64'>'
with 1091384 stored elements in Compressed Sparse Row format> 
  • 1
    `numpy` functions (and others) are not 'sparse' aware. Often they try to convert the inputs to arrays, with a `np.asarray(sparse_matrix)` - try that yourself with your matrix. – hpaulj Jun 16 '20 at 17:12

2 Answers2

1

np.diag starts with

 v = asanyarray(v)

similarity = np.dot(A, A.T) works with A sparse, because it delegates the action to the sparse matrix multiplication. The result will be a sparse matrix - you can check that yourself.

But then try to pass that to np.asanyarray.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
0

Okay, while typing out the question, I tried converting to an ndarray object and it worked. Still posting the question and the answer, it might help someone else. Cheers!

Solution:

cos_sim = cosine_sim(sparse_matrix.A)