I'm working on an NLP project where I have to compare the similarity between many sentences E.G. from this dataframe:
The first thing I tried was to make a join of the dataframe with itself to get the bellow format and compare row by row:
The problem with this that I get out of memory quickly for big medium/big datasets, e.g. for a 10k rows join I will get 100MM rows which I can not fit in ram
My current aproach is to iterate over the dataframe with as:
final = pd.DataFrame()
### for each row
for i in range(len(df_sample)):
### select the corresponding vector to compare with
v = df_sample[df_sample.index.isin([i])]["use_vector"].values
### compare all cases agains the selected vector
df_sample.apply(lambda x: cosine_similarity_numba(x.use_vector,v[0]) ,axis=1)
### kept the cases with a similarity over a given th, in this case 0.6
temp = df_sample[df_sample.apply(lambda x: cosine_similarity_numba(x.use_vector,v[0]) ,axis=1) > 0.6]
### filter out the base case
temp = temp[~temp.index.isin([i])]
temp["original_question"] = copy.copy(df_sample[df_sample.index.isin([i])]["questions"].values[0])
### append the result
final = pd.concat([final,temp])
But this aproach is not fast either. How can I improve the performance of this process?