Find similarities between strings within a DataFrame column

Question

I have similar names for clients that I want to group into one, for example:

A header
schwabstsoct2022
schwabsts
schwabregionaloct2022
schwabregional2
flagstar-2022
flagstar-2021

Some have a character I can use to separate the string and then classify it but some don't, so is there a similarity score between rows I can use to classify it quickly, and have the output on another column.

Thanks!

Try looking at [Find the similarity metric between two strings](https://stackoverflow.com/a/17388505/16653700). — Alias Cartellano, Mar 31 '23 at 16:29

score 2 · Accepted Answer · answered Mar 31 '23 at 22:03

I hope I've understood your question right. To find similarity score you can use difflib built-in module:

from difflib import SequenceMatcher

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

for s1 in df['A header']:
    df[s1] = [similar(s1, s2) for s2 in df['A header']]

print(df)

Prints:

                A header  schwabstsoct2022  schwabsts  schwabregionaloct2022  schwabregional2  flagstar-2022  flagstar-2021
0       schwabstsoct2022          1.000000   0.720000               0.702703         0.516129       0.482759       0.413793
1              schwabsts          0.720000   1.000000               0.466667         0.500000       0.272727       0.272727
2  schwabregionaloct2022          0.702703   0.466667               1.000000         0.833333       0.352941       0.294118
3        schwabregional2          0.516129   0.500000               0.833333         1.000000       0.142857       0.142857
4          flagstar-2022          0.482759   0.272727               0.411765         0.285714       1.000000       0.923077
5          flagstar-2021          0.413793   0.272727               0.352941         0.285714       0.923077       1.000000

it's really interesting, similarity score for quick classification instead of using AI — Laurent B., Apr 01 '23 at 19:10

Find similarities between strings within a DataFrame column

1 Answers1