similarity detection in same column for all rows in pandas

Question

I have a pandas data frame like this:

  Product
0 Checking savings account
1 Closing account
2 Debt collection
3 Credit reporting credit repair services personal consumer reports
4 Checking savings account

I want to check similarities between rows. Firstly, index number 0 will be compared with all other 4 rows. After it finishes, index number 1 will be compared with all other 4 rows including index number 0. And I have my own comparing/similarity check rule: firstly similar text will be check and counted, and then longer sentence will be counted and similarText will be divided to longerSentence.

For example:

Checking savings account =? Closing account --> it will be 33.3% match. Account is matched it is 1, and longer sentence is first one and it's 3. 1/3=33.3%

Checking savings account =? Debt collection --> it will be 0% match.

Here you can find a similarity check example:

I tried with this code but I can't imagine how to continue. Also I need to delete the "compareItem" during comparing operation. Because if I compare with itself, it will be 100% match always.

Code

for i in df['Product']:
    compareItem = i.split()
    print(compareItem)
    for k in df['Product']:
        compareList = k.split()
        print(compareList)
    print('------')

Output

['Checking', 'savings', 'account'] --compare item
['Checking', 'savings', 'account'] 
['Checking', 'savings', 'account']
['Debt', 'collection']
['Credit', 'reporting', 'credit', 'repair', 'services', 'personal', 'consumer', 'reports']
['Checking', 'savings', 'account']
------
.
.

Edit: I'm not checking duplication. So duplication answers won't be helpful for me.

Difference with other answer is I have different similarity check rule. I'm dividing "similar words" to "longer sentence". It's like: similarWords/longerSentence.

Does this answer your question? [Return Similarity Matrix From Two Variable-length Arrays of Strings (scipy option?)](https://stackoverflow.com/questions/50648860/return-similarity-matrix-from-two-variable-length-arrays-of-strings-scipy-optio) — mullinscr, Nov 12 '22 at 23:13
Hi, with the dataframe you provided, what should the expected result look like? — Laurent, Nov 23 '22 at 18:31

similarity detection in same column for all rows in pandas

0 Answers0