I have a bank of about 100k strings and when I get a new string, I want to match it to the most similar string.
My thoughts were to use tf-idf (makes sense as keywords are quite important), then match using the cosine distance. Is there an efficient way to do this using pandas/scikit-learn/scipy etc? I'm currently doing this:
df['cosine_distance'] = df.apply(lambda x: cosine_distances(x["tf-idf"], x["new_string"]), axis=1)
which is obviously quite slow. I was thinking of maybe a KD-tree, but it takes a lot of memory as the tf-idf vectors have a dimension of 2000.