0

I have a bank of about 100k strings and when I get a new string, I want to match it to the most similar string.

My thoughts were to use tf-idf (makes sense as keywords are quite important), then match using the cosine distance. Is there an efficient way to do this using pandas/scikit-learn/scipy etc? I'm currently doing this:

df['cosine_distance'] = df.apply(lambda x: cosine_distances(x["tf-idf"], x["new_string"]), axis=1)

which is obviously quite slow. I was thinking of maybe a KD-tree, but it takes a lot of memory as the tf-idf vectors have a dimension of 2000.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
user112633
  • 11
  • 1
  • 2

1 Answers1

0

Consider using vectorized computations rather than looping over DataFrame rows (which is very slow and should be avoided).

I'm not sure how the arrays are represented in the dataframe, so make sure you're starting out with two arrays of the same shape.

from numpy import einsum
from numpy.linalg import norm
arr_a = df["tf_idf"].values
arr_b = df["new_string"].values
cos_sim = einsum('ij,ij->i', arr_a, arr_b) / (norm(arr_a, axis=1)*norm(arr_b, axis=1))
df["cosine_distance"] = 1 - cos_sim

This code directly calculates the cosine distance using vector operations (einsum reference) and will run orders of magnitude faster than the DataFrame.apply() method.

  • Thank you! The arrays are stored as sparse vectors, e.g. [<1x2000 sparse matrix of type '' with 0 stored elements in Compressed Sparse Row format>, ..., ] Should they be converted to dense matrices first? Or can numpy deal with them as they are? – user112633 Apr 06 '22 at 12:26