0

I'm trying to find the number of similar words for all rows in Dataframe1 for every single row with words in Dataframe 2.

Based on the similarities I want to create a new data frame with where
columns = N rows of dataframe2
values = similarity.

My current code is working, but it runs very slow. I'm not sure how to optimize it...

df = pd.DataFrame([])

for x in range(10000):
    save = {}  
    terms_1 = data['text_tokenized'].iloc[x]
    save['code'] = data['code'].iloc[x]

    for y in range(3000):
        terms_2 = data2['terms'].iloc[y]
        similar_n = len(list(terms_2.intersection(terms_1)))
        save[data2['code'].iloc[y]] = similar_n

    df = df.append(pd.DataFrame([save]))

Update: new code (still running slow)

def get_sim(x, terms):
    similar_n = len(list(x.intersection(terms)))
    return similar_n

for index in icd10_terms.itertuples():
    code,terms = index[1],index[2]
    data[code] = data['text_tokenized'].apply(get_sim, args=(terms,))
  • Your problem is that you are applying two different iteration loops and that is really slow with Pandas. A vectorized solution would be the best option but it might be hard to find for this case. I would try something with apply() function. Have a look at this question https://stackoverflow.com/questions/24870953/does-iterrows-have-performance-issues/24871316#24871316 – m33n Apr 26 '18 at 11:02
  • I was also thinking about using apply(), but I don't know how to implement it in this case with 2 different dataframes – Rick Bruins Apr 26 '18 at 11:25

0 Answers0