2

I'm currently cleaning a dataset that shows names and gifts received by each person.

Each row goes like this:

Name Gift
Agustin Dellagiovanna Chocolate
Agustín Delalgiovanna Furniture
Agustín Dellagiovanna Art

As you can see in this example, these three rows represent the same person. But two of them have different typos. The same thing happens with a lot of names in the dataset.

I wanted to know if there is a way for me to find these variations of the same name and replace them with the correct spelling of the name.

For now my only idea is to find each variation after checking the list of unique values in that column, but this's very time consuming given that the dataset has 45954 rows.

Any ideas?

Thanks in advance!

  • 4
    You can use levenshtein distance to find close words. You may also consider Soundex (phonetic translation : Catthy = Katy) to compute such a distance. All of this is part of the "fuzzy matching" problem – Grall Apr 08 '22 at 19:58
  • 1
    Another issue will be to "find the right spelling" : among your close enough words (once you have computed a distance), which one should you keep ? – Grall Apr 08 '22 at 20:04
  • In this case, it's easy. Because each name corresponds to a politician so I can google the right spelling. – tarantinesque t Apr 08 '22 at 20:06

2 Answers2

2

One way to find the distance between strings is using difflib.SequenceMatcher, like in this snippet:

from difflib import SequenceMatcher

unique_names = df['Name'].unique()

threshold = 0.75
candidate_similar_names = []
for i, name1 in enumerate(unique_names[:-1]):
    for name2 in unique_names[i+1:]:
        similarity = SequenceMatcher(None, name1, name2).ratio()
        if similarity > threshold:
            candidate_similar_names.append((name1, name2, similarity))

candidate_similar_names

The lower the threshold parameter, the more distant the names will be.

lemon
  • 14,875
  • 6
  • 18
  • 38
  • 1
    Thank you, I'll start with that then! – tarantinesque t Apr 08 '22 at 20:56
  • I was wondering if I already have the names that I should change and I also have the correct names, can I iterate over the column using replace to correct all the names that are misspelled? – tarantinesque t Apr 27 '22 at 20:44
  • 1
    @tarantinesquet You could do this way but still you should be cautious. If the match is one, you know it's the one you want to substitute, if it is more than one, then maybe you're replacing something you shouldn't replace. It depends on the nature of your data. What you could do is to handle separately the list of matches longer/smaller than 1 (more matches, no match). – lemon Apr 28 '22 at 10:03
  • Yes, I understand. Thank you for your help! – tarantinesque t Apr 28 '22 at 12:36
1

You can use the package fuzzywuzzy, as shown here by Alperen Cetin:

import pandas as pd
from fuzzywuzzy import fuzz
df = pd.DataFrame({'Name': ['Agustin Dellagiovanna', 'Agustín Delalgiovanna', 'Agustín Dellagiovanna'],
                   'Gift': ['Chocolate', 'Furniture', 'Art']})

def func(input_list):
    for i in range(len(input_list)):
        for j in range(len(input_list)):
            if i < j and fuzz.ratio(input_list[i], input_list[j]) >= 90:
                input_list[i] = input_list[j]
    
    return input_list

wanted = ['Agustin Dellagiovan']

df['Name'] = func(df['Name'].to_list() + wanted)[: -len(wanted)]

This will rename all the similar items to the same. If you create your list of wanted names (all the names that are correct. I don't know how you are going to decide this), then all names should change to those (given they are similar enough).

Rawson
  • 2,637
  • 1
  • 5
  • 14
  • That's great! I can create the list easily given that the people on this df are known politicians. I think this will work perfectly. Thank you! – tarantinesque t Apr 11 '22 at 13:56