optimizing RapidFuzz for a large number of elements and obtaining match score

Question

Following this answer I am also trying to obtain the string match score between two lists. What would be the best way of doing that?

elements = pd.DataFrame({'name':['vikash', 'vikas', 'Vinod', 'Vikky', 'Akash', 'Vinodh', 'Sachin', 'Salman', 'Ajay', 'Suchin', 'Akash', 'vikahs']})

elements2= pd.DataFrame({'name': ['Ajay1', 'Suchin', 'Akassh', 'vikahs','vikash', 'vikash', 'Vinodh', 'Viky', 'Akash', 'Vinodh', 'Sachin', 'Salman','saman','Vikky']})

What I have tried so far:

from rapidfuzz.process import cdist

# Calculate distance between all the names
sa = cdist(elements, elements2, score_cutoff=90, workers=-1)

duplicates_list = []
score_list = []

for distances in sa:
    # Get indices of duplicates
    indices = np.argwhere(~np.isin(distances, [100, 0])).flatten()
    # Get names from indices
    names = list(map(elements2.__getitem__, indices))
    duplicates_list.append(names)
    # Get scores 
    condition = np.where(distances>0)
    score = np.extract(condition, distances)
    score_list.append(score)

# Create dataframe using the data
df = pd.DataFrame({'name': elements, 'duplicates': duplicates_list, 'score': score_list})

I am trying to obtain the string match and also the score.

optimizing RapidFuzz for a large number of elements and obtaining match score

0 Answers0