Iterating over multiple pandas dataframe is slow

Question

I'm trying to find the number of similar words for all rows in Dataframe1 for every single row with words in Dataframe 2.

Based on the similarities I want to create a new data frame with where
columns = N rows of dataframe2
values = similarity.

My current code is working, but it runs very slow. I'm not sure how to optimize it...

df = pd.DataFrame([])

for x in range(10000):
    save = {}  
    terms_1 = data['text_tokenized'].iloc[x]
    save['code'] = data['code'].iloc[x]

    for y in range(3000):
        terms_2 = data2['terms'].iloc[y]
        similar_n = len(list(terms_2.intersection(terms_1)))
        save[data2['code'].iloc[y]] = similar_n

    df = df.append(pd.DataFrame([save]))

Update: new code (still running slow)

def get_sim(x, terms):
    similar_n = len(list(x.intersection(terms)))
    return similar_n

for index in icd10_terms.itertuples():
    code,terms = index[1],index[2]
    data[code] = data['text_tokenized'].apply(get_sim, args=(terms,))

Your problem is that you are applying two different iteration loops and that is really slow with Pandas. A vectorized solution would be the best option but it might be hard to find for this case. I would try something with apply() function. Have a look at this question https://stackoverflow.com/questions/24870953/does-iterrows-have-performance-issues/24871316#24871316 — m33n, Apr 26 '18 at 11:02
I was also thinking about using apply(), but I don't know how to implement it in this case with 2 different dataframes — Rick Bruins, Apr 26 '18 at 11:25

Iterating over multiple pandas dataframe is slow

0 Answers0