Fastest way to compute cosine similarity in python

Question

I am trying to compute the cosine similarity between TFIDF vector representations of documents (there are 500 documents in the MySQL database) and TFIDF vector representation of the user query. Initially, I had written my own code to perform this computation (My code is commented in the snippet). This took more than 4.8 seconds on an average to perform the computation. Then on searching for ways to reduce the computation time I tried to use numpy library for this. However, now also it is taking 1.3 seconds to perform the computation.

def csim(dtv,qv):
    csim=[]
    b = np.array(qv)
    for i in range(len(dtv)):
        # numerator= np.dot(dtv[i],qv)
        # denominator = np.sqrt(np.sum(np.power(dtv[i],2)) * np.sum(np.power(qv,2)))
        # csim.append(((numerator / (denominator + 1e-9)), i + 1))
        a = np.array(dtv[i])
        numerator = np.dot(a, b)
        denominator = np.linalg.norm(a)*np.linalg.norm(b)
        csim.append(((numerator / (denominator + 1e-9)), i + 1))
    return csim

Here dtv is a list of lists of document vectors and qv is vector representation of the query vector in the form of a list. How can I reduce the computation time of cosine similarity?

Here is one reference I found but I could not understand how the cosine similarity was being computed.

I am looking to reduce the computation time in the order of milliseconds range.

Here is the TFIDF vector representation of documents 1. Here is the TFIDF vector representation of a query.

Could you please add examples of the `ctv` and `qv` variables? — Lith, Jun 17 '21 at 12:23
@Lith I have added examples of [dtv](https://pastebin.com/pZwEgxU5) and [qv](https://pastebin.com/T8p0Lm7K) in my post. — Arup Das, Jun 21 '21 at 06:54
Using a GPU for the computation might further increase the speed. — NoDataDumpNoContribution, Jun 21 '21 at 06:55
[This](https://stackoverflow.com/questions/17627219/whats-the-fastest-way-in-python-to-calculate-cosine-similarity-given-sparse-mat) may be helpful. — hilberts_drinking_problem, Jun 21 '21 at 06:59
Does this answer your question? [What's the fastest way in Python to calculate cosine similarity given sparse matrix data?](https://stackoverflow.com/questions/17627219/whats-the-fastest-way-in-python-to-calculate-cosine-similarity-given-sparse-mat) — Lith, Jun 22 '21 at 16:46
@Lith I ran it over 500 documents and it took close to 1,2 seconds. So there has not been much reduction in time. It would be really helpful if I could bring it down to milliseconds range. — Arup Das, Jun 23 '21 at 05:31

score 0 · Answer 1 · answered Sep 19 '21 at 20:38

One way to compute the cosine similarities between two batches of vectors would be to first create Numpy matrixes for each of the batch of vectors, each one of shape (n_vectors, vector_size), like this:

X = np.array(dtv) # dtv is a list of vectors, shape=(len(dtv), len(dtv[0]))
Y = np.array([b]) # b is a single list (one vector), shape(1, len(b))

Then compute the equation for cosine similarity cos = X . Y / ( ||X||*||Y|| ) in a batch form like this:

numerator = np.matmul(X, Y.transpose())
norms_X = np.sqrt((X * X).sum(axis=1))  # Or np.linalg.norm(axis=1)
norms_Y = np.sqrt((Y * Y).sum(axis=1))
denominator = np.matmul(norms_X.reshape(-1, 1), norms_Y.reshape(1, -1))
cs = np.divide(numerator, denominator)

Looking at the implementation in sklearn they do it differently, first normalizing the vectors to unit length and then doing a dot product of the matrixes of shape (n_vectors, vector_size), and according to my tests it's about twice as fast than the above implementation, here a summary of the sklearn implementation:

norms_X = np.sqrt((X * X).sum(axis=1)) # Or np.linalg.norm(axis=1)
X /= norms_X[:, np.newaxis]
norms_Y = np.sqrt((Y * Y).sum(axis=1))
Y /= norms_Y[:, np.newaxis]
cs = np.dot(X, Y.T)

Fastest way to compute cosine similarity in python

1 Answers1