0

I have the following loop that runs this 'similarity' function and stores its result in a 'results' variable that is added as a column in the dataframe at index position i, passed in the for loop, formatted by the 'similarity_formated' function. However, because it is a function that passes through the entire dataframe, it takes a long time to run.

for i,row in df_teste.iterrows():
    results = similarity(embeddings, df_acordaos,i,0)
    df_teste.at[i,'SIMILARITY_ALL']=similarity_formated(results)
    print(np.array(results[['process_number','SIMILARITY']]).shape) 

The function of similarity is as follows:

def similarity(tfidfs, df, df_idx,classes,expected_similarity=-1):
    df_=df.copy()
    df_=df_.reset_index().rename(columns={'index': 'df_index'})
    col = df_[df_['df_index']==df_idx]
    idx=col.index[0]
    
    df_['SIMILARITY']=0
    
    for indice in df_.index:
        value_similarity = cosine_similarity(tfidfs[idx],tfidfs[indice])[0][0]
        if value_similarity >= expected_similarity:
            df_.at[indice,'SIMILARITY']=value_similarity
    df_ = df_.sort_values('SIMILARITY',ascending=False,inplace=False)
    return df_

and the 'similarity_formated' is:

def similarity_formated(results):
    processes=list(results['process_number'])
    similaritys= list(results['SIMILARITY'])
    text=''
    for p,s in zip(process, similaritys):
        text+=p+' '+"{:.3f}".format(s)+' '
    return text

I tried several ways to parallelize this loop, but I still haven't found a suitable way.

  • My 2 cents: Before trying to parallelize something here, you should optimize your code. My impression is that you don't use Pandas as it's supposed to be used, and are therefore wasting a lot of performance. You should strip down your question to the essential steps, and provide a [MRE](https://stackoverflow.com/help/minimal-reproducible-example) (also look [here](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples)). – Timus Sep 23 '22 at 10:43

0 Answers0