I have the following loop that runs this 'similarity' function and stores its result in a 'results' variable that is added as a column in the dataframe at index position i, passed in the for loop, formatted by the 'similarity_formated' function. However, because it is a function that passes through the entire dataframe, it takes a long time to run.
for i,row in df_teste.iterrows():
results = similarity(embeddings, df_acordaos,i,0)
df_teste.at[i,'SIMILARITY_ALL']=similarity_formated(results)
print(np.array(results[['process_number','SIMILARITY']]).shape)
The function of similarity is as follows:
def similarity(tfidfs, df, df_idx,classes,expected_similarity=-1):
df_=df.copy()
df_=df_.reset_index().rename(columns={'index': 'df_index'})
col = df_[df_['df_index']==df_idx]
idx=col.index[0]
df_['SIMILARITY']=0
for indice in df_.index:
value_similarity = cosine_similarity(tfidfs[idx],tfidfs[indice])[0][0]
if value_similarity >= expected_similarity:
df_.at[indice,'SIMILARITY']=value_similarity
df_ = df_.sort_values('SIMILARITY',ascending=False,inplace=False)
return df_
and the 'similarity_formated' is:
def similarity_formated(results):
processes=list(results['process_number'])
similaritys= list(results['SIMILARITY'])
text=''
for p,s in zip(process, similaritys):
text+=p+' '+"{:.3f}".format(s)+' '
return text
I tried several ways to parallelize this loop, but I still haven't found a suitable way.