0

I have a dataframe as follows, only with more rows:

import pandas as pd

data = {'First':  ['First value', 'Second value','Third value'],
'Second': [['old','new','gold','door'], ['old','view','bold','door'],['new','view','world','window']]}

df = pd.DataFrame (data, columns = ['First','Second'])

To calculate the jaccard similarity i found this piece online(not my solution):

def lexical_overlap(doc1, doc2): 
    words_doc1 = set(doc1) 
    words_doc2 = set(doc2)

    intersection = words_doc1.intersection(words_doc2)
    union = words_doc1.union(words_doc2)
    
    return float(len(intersection)) / len(union) * 100

what i would like to get as a result is for the measure to take each row of the Second column as doc and compare each pair iteratively and outputs a measure with the row name from the First column something like this :

First value and Second value = 80 

First value and Third value  = 95

Second value and Third value = 90
ALollz
  • 57,915
  • 7
  • 66
  • 89
zara kolagar
  • 881
  • 3
  • 15
  • 1
    How large is your data? One should be very careful when working with pair-wise measurements on large data. – Quang Hoang Dec 15 '20 at 15:28
  • compare each pair iteratively ? which pair, please – Aaj Kaal Dec 15 '20 at 15:30
  • 1
    Those output values don't seem correct given your data? – ALollz Dec 15 '20 at 15:30
  • Which part are you having trouble with? *Allocating/getting* the distinct pairs? – wwii Dec 15 '20 at 15:33
  • hi all, there are around 30 rows and by pairs i meant for example (first value & second value, first value &third value,...etc), yes, i just put the output as an example of how wanted them to look), the problem is that i want the function to take all doc1 and doc 2 from the dataframe iteratively – zara kolagar Dec 15 '20 at 15:41
  • Does [All possible combinations of pandas data frame rows](https://stackoverflow.com/questions/51746635/all-possible-combinations-of-pandas-data-frame-rows) answer your question? – wwii Dec 15 '20 at 15:45
  • yes, but I cannot figure out how to use it with the jaccard function as to output the result for each pair – zara kolagar Dec 15 '20 at 15:54

2 Answers2

1

Since your data is not big, you can try broadcasting with slightly different approach:

# dummy for each rows
s = pd.get_dummies(df.Second.explode()).sum(level=0).values

# pair-wise jaccard
(s@s.T)/(s|s[:,None,:]).sum(-1) * 100

Output:

array([[100.        ,  33.33333333,  14.28571429],
       [ 33.33333333, 100.        ,  14.28571429],
       [ 14.28571429,  14.28571429, 100.        ]])
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
  • Nice - off the top of your head, where in the docs are the @ and | operators mentioned. Or they just work the same because they are Python operators. – wwii Dec 15 '20 at 16:08
  • I always thought the level parameter was for a MultIndex, I didn't know it could work like that on row indices with *duplicate* values. – wwii Dec 15 '20 at 16:22
  • @wwii they (`@` and `|`) are operators on numpy arrays, which can work with pandas as well. `s` is actually a 2d numpy array. – Quang Hoang Dec 15 '20 at 16:28
1

Well, I'd do it somewhat like this:

from itertools import combinations

for val in list(combinations(range(len(df)), 2)):
    firstlist = df.iloc[val[0],1]
    secondlist = df.iloc[val[1],1]
    
    value = round(lexical_overlap(firstlist,secondlist),2)
    
    print(f"{df.iloc[val[0],0]} and {df.iloc[val[1],0]}'s value is: {value}")

Output:

First value and Second value's value is: 33.33
First value and Third value's value is: 14.29
Second value and Third value's value is: 14.29
Amit Amola
  • 2,301
  • 2
  • 22
  • 37
  • `for combo in itertools.combinations(df.Second,2): js = lexical_overlap(*); print(js)`. – wwii Dec 15 '20 at 16:09
  • Could you elaborate a bit more on this? – Amit Amola Dec 15 '20 at 16:11
  • 1
    It is just a simplification of your loop without using indices to retrieve the combinations - `combo` in my comment will have the two lists to compare. However in my zeal, I *mis-wrote* it - it should be `for combo in itertools.combinations(df.Second,2): js = lexical_overlap(*combo); print(js)` – wwii Dec 15 '20 at 16:25