I have a pandas data frame like this:
Product
0 Checking savings account
1 Closing account
2 Debt collection
3 Credit reporting credit repair services personal consumer reports
4 Checking savings account
I want to check similarities between rows. Firstly, index number 0 will be compared with all other 4 rows. After it finishes, index number 1 will be compared with all other 4 rows including index number 0. And I have my own comparing/similarity check rule: firstly similar text will be check and counted, and then longer sentence will be counted and similarText will be divided to longerSentence.
For example:
Checking savings account =? Closing account --> it will be 33.3% match. Account is matched it is 1, and longer sentence is first one and it's 3. 1/3=33.3%
Checking savings account =? Debt collection --> it will be 0% match.
Here you can find a similarity check example:
I tried with this code but I can't imagine how to continue. Also I need to delete the "compareItem" during comparing operation. Because if I compare with itself, it will be 100% match always.
Code
for i in df['Product']:
compareItem = i.split()
print(compareItem)
for k in df['Product']:
compareList = k.split()
print(compareList)
print('------')
Output
['Checking', 'savings', 'account'] --compare item
['Checking', 'savings', 'account']
['Checking', 'savings', 'account']
['Debt', 'collection']
['Credit', 'reporting', 'credit', 'repair', 'services', 'personal', 'consumer', 'reports']
['Checking', 'savings', 'account']
------
.
.
Edit: I'm not checking duplication. So duplication answers won't be helpful for me.
Difference with other answer is I have different similarity check rule. I'm dividing "similar words" to "longer sentence". It's like: similarWords/longerSentence.