1

I have the next DataFrame(df) in pandas: (This is just an example the real DF is more than 2000 rows and more than 20 names)

ID Name
1 Andrea Gonzlez
2 Andrea Glz
3 Andrea Glez
4 Lineth Arce
5 lineth a
6 lineth aerc

I want to compare row 1 name with row 2 name and if they are >80% ratio, then row 2 gets changed to name in row 1. So in the end i will have a column where i only have different names of each one.

What I did is i created a list with the names = ['Andrea Glz', 'Lineth Arce'] and then create a function:

def compare(x): 
   for i in names:
      ratio = fuzz.token_set_ratio(i,x)
      if ratio > 80:
        return i

Then use the next code and rewrite the column with the matched result from the names list:

df['Name'] = df['Name'].apply(compare)

I get the desired result but takes a lot of processing time. Is there an easier and faster way of doing this?

Desired result table:

ID Name
1 Andrea Gonzlez
2 Andrea Gonzlez
3 Andrea Gonzlez
4 Lineth Arce
5 Lineth Arce
6 Lineth Arce
Init5 God
  • 11
  • 2

1 Answers1

0

You can do the following:

1> Find unique names from the dataframe

2> Find unique 2 combinations of the name. Use itertools. See here

---Name1-------|----Name2-------|
Andrea Gonzlez | Andrea Gonzlez |
Andrea Gonzlez | Lineth Arce    |
...
---------------|----------------|

3> Find similarity of two columns

---Name1-------|----Name2-------|----similarity---|
Andrea Gonzlez | Andrea Gonzlez |    100          |
Andrea Gonzlez | Lineth Arce    |     20          |
...
---------------|----------------|-----------------|

4> Select the rows where similarity is less than 80% and from them only select Name1

s510
  • 2,271
  • 11
  • 18
  • Thank you, it may work like that but its a lot more coding than the one i already have. What im looking to get is how can i loop through one column and use the "next" row in the same operation. How to compare the row im using with the one before. – Init5 God Sep 07 '22 at 14:56
  • Looping is a single threaded process mostly therefore the processing time isn't expected to reduce drastically compare to the current method you are already doing. – s510 Sep 07 '22 at 15:01
  • Thank you. I'll keep investigating the way of doing this as fast as I can. I appreciate your help – Init5 God Sep 07 '22 at 23:28