pandas:calculate jaccard similarity for every row based on the value in multiple columns

Question

I have a dataframe that looks like the following, but with more rows. for each document in the fist column there are some similar labels in the second column and some strings in the last column.

import pandas as pd

data = {'First':  ['First doc', 'Second doc','Third doc','First doc', 'Second doc','Third doc'
              ,'First doc', 'Second doc','Third doc'],
    'second':  ['First', 'Second','Third','second', 'third','first',
               'third','first','second'],
    'third': [['old','far','gold','door'], ['old','view','bold','values'],
      ['new','view','sure','window'],['old','bored','gold','door'], 
      ['valued','this','bold','door'],['new','view','seen','shirt'],
      ['old','bored','blouse','door'], ['valued','this','bold','open'],
      ['new','view','seen','win']]}

df = pd.DataFrame (data, columns = ['First','second','third'])
df

i have stumbled upon this piece of code for jaccard similarity:

def lexical_overlap(doc1, doc2): 
   words_doc1 = set(doc1) 
   words_doc2 = set(doc2)

   intersection = words_doc1.intersection(words_doc2)
   union = words_doc1.union(words_doc2)

   return float(len(intersection)) / len(union) * 100

what i would like to get as a result is for the measure to take each row of the third column as doc and compare each pair iteratively and outputs a measure with the row name from the First and second column, so something like this for all combinations :

   first doc(first) and second doc(first) are 23 percent similar

I have already asked a similar question and have tried to modify the answer, but did not have any luck with adding multiple columns

score 0 · Answer 1 · answered Dec 22 '20 at 19:54

This is not very elegant but hopefully it gets the job done. I converted the column 'third' into a list. For each item in this list, I created a new data frame new_df which is a copy of the original dataframe df. I added a column 'compared with' to new_df to note 'First' column that was being compared. Then I used a lambda function over df to calculate the lexical overlap over both string lists

third_list = df['third'].tolist()
for i in range(0,len(third_list)):
    new_df = df.copy()
    new_df["compared with"] = df['First'].iloc[i] 
    new_df["sim"] = df.apply(lambda x: lexical_overlap(x[2],df['third'].iloc[i] ), axis =1)
    print("\n\n")
    print(new_df[['First', 'compared with', 'sim']])

This produces the below output. The document when compared to itself gets the highest similarity.


        First compared with         sim
0   First doc     First doc  100.000000
1  Second doc     First doc   14.285714
2   Third doc     First doc    0.000000
3   First doc     First doc   60.000000
4  Second doc     First doc   14.285714
5   Third doc     First doc    0.000000
6   First doc     First doc   33.333333
7  Second doc     First doc    0.000000
8   Third doc     First doc    0.000000



        First compared with         sim
0   First doc    Second doc   14.285714
1  Second doc    Second doc  100.000000
2   Third doc    Second doc   14.285714
3   First doc    Second doc   14.285714
4  Second doc    Second doc   14.285714
5   Third doc    Second doc   14.285714
6   First doc    Second doc   14.285714
7  Second doc    Second doc   14.285714
8   Third doc    Second doc   14.285714

If you wish you can replace the print line 7 as follows:

print(new_df.apply(lambda x:" ".join([x[0],'and',x[3], 'are', "{:.2f}".format(x[4]),'percent similar']), axis =1))

This creates the output:

0    First doc and First doc are 100.00 percent sim...
1    Second doc and First doc are 14.29 percent sim...
2     Third doc and First doc are 0.00 percent similar
3    First doc and First doc are 60.00 percent similar
4    Second doc and First doc are 14.29 percent sim...
5     Third doc and First doc are 0.00 percent similar
6    First doc and First doc are 33.33 percent similar
7    Second doc and First doc are 0.00 percent similar
8     Third doc and First doc are 0.00 percent similar
dtype: object

hi, thank you for the reply, but i could already get this output with the similar answer i gave a link to in my question. what I need is to have all combinations of the first and second column, e.g first doc (second) and first doc (third) are 20 percent similar — zara kolagar, Dec 23 '20 at 08:05

score 0 · Answer 2 · answered Dec 23 '20 at 09:00

ok, I figured how to do that with help from this response by Amit Amola so what I did was to refine the code to get all combinations:

from itertools import combinations

for val in list(combinations(range(len(df)), 2)):
     firstlist = df.iloc[val[0],2]
     secondlist = df.iloc[val[1],2]

     value = round(lexical_overlap(firstlist,secondlist),2)

     print(f"{df.iloc[val[0],0] + df.iloc[val[0],1]} and {df.iloc[val[1],0]+ df.iloc[val[1],1]}'s value is: {value}")

this will return values from both first and second column

sample output:
First doc first and second doc first's value is 26.

pandas:calculate jaccard similarity for every row based on the value in multiple columns

2 Answers2