I have a dataframe that looks like the following, but with more rows. for each document in the fist column there are some similar labels in the second column and some strings in the last column.
import pandas as pd
data = {'First': ['First doc', 'Second doc','Third doc','First doc', 'Second doc','Third doc'
,'First doc', 'Second doc','Third doc'],
'second': ['First', 'Second','Third','second', 'third','first',
'third','first','second'],
'third': [['old','far','gold','door'], ['old','view','bold','values'],
['new','view','sure','window'],['old','bored','gold','door'],
['valued','this','bold','door'],['new','view','seen','shirt'],
['old','bored','blouse','door'], ['valued','this','bold','open'],
['new','view','seen','win']]}
df = pd.DataFrame (data, columns = ['First','second','third'])
df
i have stumbled upon this piece of code for jaccard similarity:
def lexical_overlap(doc1, doc2):
words_doc1 = set(doc1)
words_doc2 = set(doc2)
intersection = words_doc1.intersection(words_doc2)
union = words_doc1.union(words_doc2)
return float(len(intersection)) / len(union) * 100
what i would like to get as a result is for the measure to take each row of the third column as doc and compare each pair iteratively and outputs a measure with the row name from the First and second column, so something like this for all combinations :
first doc(first) and second doc(first) are 23 percent similar
I have already asked a similar question and have tried to modify the answer, but did not have any luck with adding multiple columns