I am trying to compute the cosine similarity between TFIDF vector representations of documents (there are 500 documents in the MySQL database) and TFIDF vector representation of the user query. Initially, I had written my own code to perform this computation (My code is commented in the snippet). This took more than 4.8 seconds on an average to perform the computation. Then on searching for ways to reduce the computation time I tried to use numpy library for this. However, now also it is taking 1.3 seconds to perform the computation.
def csim(dtv,qv):
csim=[]
b = np.array(qv)
for i in range(len(dtv)):
# numerator= np.dot(dtv[i],qv)
# denominator = np.sqrt(np.sum(np.power(dtv[i],2)) * np.sum(np.power(qv,2)))
# csim.append(((numerator / (denominator + 1e-9)), i + 1))
a = np.array(dtv[i])
numerator = np.dot(a, b)
denominator = np.linalg.norm(a)*np.linalg.norm(b)
csim.append(((numerator / (denominator + 1e-9)), i + 1))
return csim
Here dtv is a list of lists of document vectors and qv is vector representation of the query vector in the form of a list. How can I reduce the computation time of cosine similarity?
Here is one reference I found but I could not understand how the cosine similarity was being computed.
I am looking to reduce the computation time in the order of milliseconds range.
Here is the TFIDF vector representation of documents 1. Here is the TFIDF vector representation of a query.