Create clusters based on string similarly in pandas

Question

I have a lists of names around 200-300k

For eg.

Names1	Names2
Mr.Reven	Alex
Freddie	Keven
Miss.Grey	Moeen
James	Shayne
Neoveeen	Frey
Boult	mcKay
Dr.Alen	Adames
Alsray	Miss. Slout

Names1 should be compared with Names2 with each and every value and then my pandas code should create different clusters like cluster-1, Cluster-2, Cluster-3 etc. And in those cluster there should be a list of similar names( with honorifics removed prefix or suffix) with similarly greater than 90%

For eg.

Cluster-1	Cluster-2	Cluster-3
Frey	Reven	Moeen
Grey	Keven	Neoveeen

Is there any way to do that in pandas?

What is your precise definition of similar names and how to calculate the similarity value you want? look like you want names have same characters with same order, but how do you define if two names similarly greater than 90%? — Mr. For Example, Jan 02 '21 at 03:54
@Ukrainian-serge I have tried but not for these logic, just for the sake of understanding I just tried it using a very small example. — Red Vibes, Jan 02 '21 at 04:11
@Mr. For Example not necessarily orders should match. 90% match in the sense there are many algos like leveishtin dist., Jaro Winkler, fuzzy wuzzy etc among these the best algo should be taken for these particular logic which gives me precise score ,based on that score and then collect those names which have a higher % match. Is my explanation clear now?. Please let me know if have made it clear. — Red Vibes, Jan 02 '21 at 04:37
@RedVibes So, I just want know, did my answer solve this question or you think I miss something? — Mr. For Example, Jan 02 '21 at 10:40
@Mr. For Example I didn't check with my data. But it really gonna help me to proceed. Thanks a lot for giving your precious time to me. I will surely revert back with update. By the time I just want to ask how I can I improve my python coding skills, I have seen a lot of programming tutorial but still facing issues while writing a code and creating the logic behind the code. — Red Vibes, Jan 02 '21 at 10:44
@RedVibes I think I can give you few advise about improve coding skill base on the path I walking by: Before you dig deep into python coding technique, you should improve your general problem solving skill first, that is learn how to learn before spend tons of time to mastering a skill, I believe you can find many free material online (books like [Are Your Lights On?: How to Figure Out What the Problem Really Is](https://www.amazon.com/Are-Your-Lights-Figure-Problem/dp/0932633161)) — Mr. For Example, Jan 02 '21 at 11:04
@RedVibes Beware things even more important than knowing the right way to learn is to find out what you really want achieve with your time, you don't want spend half of years to learn about `Pygame` and find out you want use `Unity` to build the 3D game, PS: no offense to `Pygame` : ) — Mr. For Example, Jan 02 '21 at 11:08
@Pygame Thanks a lot. I didn't even realised 'that is learn how to learn before spend tons of time to mastering a skill' this can be your turning point and first step towards success. Hats off I will surely follow your guidelines. — Red Vibes, Jan 02 '21 at 12:05
@Pygame does this code works in this manner - first value of column **names1** starts comparing with each and every value of column **names2** and then it collect those names with higher score in a particular cluster...then second value starts comparing with each nd evry value of names2....in this way — Red Vibes, Jan 02 '21 at 14:31

score 1 · Answer 1 · answered Jan 02 '21 at 06:34

Example code base on the this similarity metric:

import pandas as pd
from difflib import SequenceMatcher
import numpy as np
import re

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

def remove_prefix(s):
    return re.split('\.| |_|-', s)[-1]

# Mimic dataframe
d = {'Names1': ['Mr.Reven', 'Freddie', 'Miss.Grey', 'James', 'Neoveeen', 'Boult', 'Dr.Alen', 'Alsray'], 
     'Names2': ['Alex', 'Keven', 'Moeen', 'Shayne', 'Frey', 'mcKay', 'Adames', 'Miss. Slout']}
df = pd.DataFrame(d)

# Get two list names with remove prefix
remove_prefix_fv = np.vectorize(remove_prefix)
names1 = remove_prefix_fv(df['Names1'].to_numpy())
names2 = remove_prefix_fv(df['Names2'].to_numpy())

# Get similarity scores for each pairs between Names1 and Names2
similar_fv = np.vectorize(similar)
scores = similar_fv(names1[:, np.newaxis], names2)

# Filter out the pairs above the threshold
threshold = 0.7
ind = np.where(scores >= threshold)

# Cluster the Names2 elements with same Names1 element
uc = np.unique(ind[0])
cd = {"Cluster-" + str(i): [names1[uc[i]]] + list(names2[ind[1][np.where(ind[0] == uc[i])[0]]]) for i in range(len(uc))}

# Build the dataframe
cdf = pd.DataFrame(cd)
print(cdf)

Outputs:

  Cluster-0 Cluster-1 Cluster-2 Cluster-3
0     Reven      Grey     James      Alen
1     Keven      Frey    Adames      Alex

Create clusters based on string similarly in pandas

1 Answers1