0

I am working on a problem that needs similarity metrics to extract a subset of data from a larger set for further analysis.

The way I am extracting the subset is by using cosine similarity above certain threshold. The toy set below describes the problem:

import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

df = pd.DataFrame(np.random.normal(0, 1, (10, 10)))

the similarity of the 10 "samples" by 10 "features" is given by this matrix:

similarity_df = pd.DataFrame(cosine_similarity(df))

(as an example, the comparison heatmap looks like this): enter image description here Given a new vector:

new_vector = np.array([1.0, 0.49, -0.05, 0.36, -1.0, -1.8, 0.21, 0.31, 0, -1])

I find similarity using this function:

def find_similarity(df, search):
    rankings = []
    for i in range(len(df)):
        sim = int(100*cosine_similarity(search.reshape(1,-1), df.iloc[i].values.reshape(1,-1))[0][0])
        rankings.append((i, sim))
    df= pd.DataFrame(rankings)
    df.columns = ['item', 'similarity']
    df.sort_values(by='similarity', inplace=True, ascending=False)
    df.reset_index(drop=True, inplace=True)
    return df

Then return the items most similar, above a threshold:

r = find_similarity(df, new_vector)
r[r['similarity']>50]

This works fine, the problem I have is that my data set has over 1 million rows by 200 columns, and the similarity_df is 1 million by 1 million. Needless to say, the sequential comparison in my for loop is slow.

for i in range(len(df)):
    sim = int(100*cosine_similarity(search.reshape(1,-1), df.iloc[i].values.reshape(1,-1))[0][0])

This is compounded by the fact that I need to find similarity for about 100 vectors at a time.

Is there a faster way to do this search/ranking? Maybe a hashing function that maximizes similarity (as opposed to minimizing collusion), such as LSH? Although I am using cosine similarity, maybe another similarity metric does a better job, but for now the problem is a faster way to do the one to many comparison.

Luis Miguel
  • 5,057
  • 8
  • 42
  • 75
  • 1
    Could you use [`cdist(XA, XB, 'cosine')`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html)? – Divakar Sep 05 '19 at 15:01
  • @Divakar, thank you for the feedback. This looks very promising, I did not know this function..I'll play with it later – Luis Miguel Sep 05 '19 at 15:10
  • @Divakar..I suggest you move your suggestion to answer so I can accept it. It worked and it is extremely fast (there are some small mods in between). Thanks again! – Luis Miguel Sep 05 '19 at 15:37
  • 1
    Those timings in the second linked Q&A's answer might interest readers looking for performance. – Divakar Sep 05 '19 at 17:58

0 Answers0