pandas: calculate overlapping words between rows only if values in another column match

Question

I have a dataframe that looks like the following, but with many rows:

import pandas as pd

data = {'intent':  ['order_food', 'order_food','order_taxi','order_call','order_call','order_taxi'],
'Sent': ['i need hamburger','she wants sushi','i need a cab','call me at 6','she called me','i would like a new taxi' ],
'key_words': [['need','hamburger'], ['want','sushi'],['need','cab'],['call','6'],['call'],['new','taxi']]}

df = pd.DataFrame (data, columns = ['intent','Sent','key_words'])

I have calculated the jaccard similarity using the code below (not my solution):

def lexical_overlap(doc1, doc2): 
    words_doc1 = set(doc1) 
    words_doc2 = set(doc2)

    intersection = words_doc1.intersection(words_doc2)


    return intersection

and modified the code given by @Amit Amola to compare overlapping words between every possible two rows and created a dataframe out of it:

overlapping_word_list=[]

for val in list(combinations(range(len(data_new)), 2)):
     overlapping_word_list.append(f"the shared keywords between {data_new.iloc[val[0],0]} and {data_new.iloc[val[1],0]} sentences are: {lexical_overlap(data_new.iloc[val[0],1],data_new.iloc[val[1],1])}")
#creating an overlap dataframe
banking_overlapping_words_per_sent = DataFrame(overlapping_word_list,columns=['overlapping_list'])

since my dataset is huge, when i run this code to compare all rows, it takes forever. so i would like to instead only compare the sentences which have the same intents and do not compare sentences that have different intents. I am not sure on how to proceed to do only that

score 0 · Accepted Answer · answered Apr 28 '21 at 11:34

0

IIUC you just need to iterate over the unique values in the intent column and then use loc to grab just the rows that correspond to that. If you have more than two rows you will still need to use combinations to get the unique combinations between similar intents.

from itertools import combinations

for intent in df.intent.unique():
    # loc returns a DataFrame but we need just the column
    rows = df.loc[df.intent == intent, ["Sent"]].Sent.to_list()
    combos = combinations(rows, 2)
    for combo in combos:
        x, y = rows
        overlap = lexical_overlap(x, y)
        print(f"Overlap for ({x}) and ({y}) is {overlap}")

#  Overlap for (i need hamburger) and (she wants sushi) is 46.666666666666664
#  Overlap for (i need a cab) and (i would like a new taxi) is 40.0
#  Overlap for (call me at 6) and (she called me) is 54.54545454545454

answered Apr 28 '21 at 11:34

gold_cy

13,648
3
23
45

thank you very much for your reply. could you please tell me how i can have the output as something like (e,g the overlap of intent order_call for (call me at 6) and (she called me) is {'call'} based on the key_words column if i change the lexical_overlap function to output the intersection only? thank you very much – zara kolagar Apr 28 '21 at 12:24
sorry I’m not following your question. your lexical intersection function only outputs the intersection nothing else. as for what you want to print that’s up to you. – gold_cy Apr 28 '21 at 12:34
sorry if I am was not clear on my question. so i would like to have an output like the following from your function: example: the overlap of intent (order_call) for (call me at 6) and (she called me) is {'call'} , and ofcourse for the rest it is an empty set so I figured that i could make this change in your code: df.loc[df.intent == intent, ['intent','key_words','Sent']].values.tolist(), but do not know how to proceed to get the output i mentioned above – zara kolagar Apr 28 '21 at 12:43
the only issue i have is that it does not apply to a situation where there are more instance of an intent – zara kolagar Apr 28 '21 at 13:23

score 0 · Answer 2 · answered Apr 28 '21 at 12:55

ok, so I figured out what to do to get my desired output mentioned in the comments based on @gold_cy 's answer:

for intent in df.intent.unique():
# loc returns a DataFrame but we need just the column
   rows = df.loc[df.intent == intent,['intent','key_words','Sent']].values.tolist()
   combos = combinations(rows, 2)
   for combo in combos:
       x, y = rows
       overlap = lexical_overlap(x[1], y[1])
       print(f"Overlap of intent ({x[0]}) for ({x[2]}) and ({y[2]}) is {overlap}")

pandas: calculate overlapping words between rows only if values in another column match

2 Answers2

Linked