1

I have similar names for clients that I want to group into one, for example:

A header
schwabstsoct2022
schwabsts
schwabregionaloct2022
schwabregional2
flagstar-2022
flagstar-2021

Some have a character I can use to separate the string and then classify it but some don't, so is there a similarity score between rows I can use to classify it quickly, and have the output on another column.

Thanks!

1 Answers1

2

I hope I've understood your question right. To find similarity score you can use difflib built-in module:

from difflib import SequenceMatcher

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

for s1 in df['A header']:
    df[s1] = [similar(s1, s2) for s2 in df['A header']]

print(df)

Prints:

                A header  schwabstsoct2022  schwabsts  schwabregionaloct2022  schwabregional2  flagstar-2022  flagstar-2021
0       schwabstsoct2022          1.000000   0.720000               0.702703         0.516129       0.482759       0.413793
1              schwabsts          0.720000   1.000000               0.466667         0.500000       0.272727       0.272727
2  schwabregionaloct2022          0.702703   0.466667               1.000000         0.833333       0.352941       0.294118
3        schwabregional2          0.516129   0.500000               0.833333         1.000000       0.142857       0.142857
4          flagstar-2022          0.482759   0.272727               0.411765         0.285714       1.000000       0.923077
5          flagstar-2021          0.413793   0.272727               0.352941         0.285714       0.923077       1.000000
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91