Pandas - change next row on single column based on the fuzzy wuzzy result of comparing row[i] with row[i+1]

Question

I have the next DataFrame(df) in pandas: (This is just an example the real DF is more than 2000 rows and more than 20 names)

ID	Name
1	Andrea Gonzlez
2	Andrea Glz
3	Andrea Glez
4	Lineth Arce
5	lineth a
6	lineth aerc

I want to compare row 1 name with row 2 name and if they are >80% ratio, then row 2 gets changed to name in row 1. So in the end i will have a column where i only have different names of each one.

What I did is i created a list with the names = ['Andrea Glz', 'Lineth Arce'] and then create a function:

def compare(x): 
   for i in names:
      ratio = fuzz.token_set_ratio(i,x)
      if ratio > 80:
        return i

Then use the next code and rewrite the column with the matched result from the names list:

df['Name'] = df['Name'].apply(compare)

I get the desired result but takes a lot of processing time. Is there an easier and faster way of doing this?

Desired result table:

ID	Name
1	Andrea Gonzlez
2	Andrea Gonzlez
3	Andrea Gonzlez
4	Lineth Arce
5	Lineth Arce
6	Lineth Arce

score 0 · Answer 1 · answered Sep 06 '22 at 19:54

0

You can do the following:

1> Find unique names from the dataframe

2> Find unique 2 combinations of the name. Use itertools. See here

---Name1-------|----Name2-------|
Andrea Gonzlez | Andrea Gonzlez |
Andrea Gonzlez | Lineth Arce    |
...
---------------|----------------|

3> Find similarity of two columns

---Name1-------|----Name2-------|----similarity---|
Andrea Gonzlez | Andrea Gonzlez |    100          |
Andrea Gonzlez | Lineth Arce    |     20          |
...
---------------|----------------|-----------------|

4> Select the rows where similarity is less than 80% and from them only select Name1

answered Sep 06 '22 at 19:54

s510

2,271
11
18

Thank you, it may work like that but its a lot more coding than the one i already have. What im looking to get is how can i loop through one column and use the "next" row in the same operation. How to compare the row im using with the one before. – Init5 God Sep 07 '22 at 14:56
Looping is a single threaded process mostly therefore the processing time isn't expected to reduce drastically compare to the current method you are already doing. – s510 Sep 07 '22 at 15:01
Thank you. I'll keep investigating the way of doing this as fast as I can. I appreciate your help – Init5 God Sep 07 '22 at 23:28

Pandas - change next row on single column based on the fuzzy wuzzy result of comparing row[i] with row[i+1]

1 Answers1