Replace similar strings in a column with the same string

Question

I have a Pandas dataframe that collects the names of vendors at which a transaction was made. As this data is automatically collected from bank statements, lots of the vendors are similar... but not quite the same. In summary, I want to replace the different permutations of the vendors' names with a single name.

I think I can work out a way to do it (see below), but I'm a beginner and this seems to me like it's a complex problem. I'd be really interested to see how more experienced coders would approach it.

I have a dataframe like this (in real life, it's about 20 columns and a maximum of around 50 rows):

     Groceries            Car                Luxuries
0    Sainsburys           Texaco wst453      Amazon
1    Sainsburys bur       Texaco east        Firebox Ltd
2    Sainsbury's east     Shell wstl         Sony
3    Tesco                Shell p/stn        Sony ent nrk
4    Tescos ref 657       Texac              Amazon EU
5    Tesco 45783          Moto               Amazon marketplace

I'd like to find the similar entries and replace them with the first instance of those entries, so I'd end up with this:

     Groceries            Car                Luxuries
0    Sainsburys           Texaco wst453      Amazon
1    Sainsburys           Texaco wst453      Firebox Ltd
2    Sainsburys           Shell wstl         Sony
3    Tesco                Shell wstl         Sony
4    Tesco                Texaco wst453      Amazon
5    Tesco                Moto               Amazon

My solution might be far from optimum. I was thinking of sorting alphabetically, then going through bitwise and using something like SequenceMatcher from difflib to compare each pair of vendors. If the similarity is above a certain percentage (I'm expecting to play with this value until I'm happy) then the two vendors will be assumed to be the same. I'm concerned that I might be using a sledgehammer to crack a nut, or it might take a long time (I'm not obsessed with performance, but equally I don't want to wait hours for the result).

Really interested to hear people's thoughts on this problem!

I think it may be easier if you have first word same in every record that has to have same name for e.g. there are only `Tesco` instead of also having `Tescos` otherwise there may need of looking into similarity. — niraj, Nov 26 '17 at 20:41
Yeah, the first word isn't always the same. Unfortunately, it seems that similarity checking is needed in some way. — user4896331, Nov 26 '17 at 20:53

Alperen · Answer 1 · 2020-11-07T10:16:44.767

At the start, the problem doesn't seem complicated, but it is.

I used string similarity package named fuzzywuzzy to decide which string must be replaced. This package uses Levenshtein Similarity, and I used %90 as the threshold value. Also, the first word of any string is used as comparison string. Here is my code:

import pandas
from fuzzywuzzy import fuzz

# Replaces %90 and more similar strings  
def func(input_list):
    for count, item in enumerate(input_list):
        rest_of_input_list = input_list[:count] + input_list[count + 1:]
        new_list = []
        for other_item in rest_of_input_list:
            similarity = fuzz.ratio(item, other_item)
            if similarity >= 90:
                new_list.append(item)
            else:
                new_list.append(other_item)
        input_list = new_list[:count] + [item] + new_list[count :]
                
    return input_list

df = pandas.read_csv('input.txt') # Read data from csv
result = []
for column in list(df):
    column_values = list(df[column])
    first_words = [x[:x.index(" ")] if " " in x else x for x in column_values]
    result.append(func(first_words))
    
new_df = pandas.DataFrame(result).transpose() 
new_df.columns = list(df)

print(new_df)

Output:

     Groceries    Car Luxuries
0  Sainsbury's  Texac   Amazon
1  Sainsbury's  Texac  Firebox
2  Sainsbury's  Shell     Sony
3        Tesco  Shell     Sony
4        Tesco  Texac   Amazon
5        Tesco   Moto   Amazon

UPDATE:

More readable version of func, which produces the same result:

def func(input_list):
    for i in range(len(input_list)):
        for j in range(len(input_list)):
            if i < j and fuzz.ratio(input_list[i], input_list[j]) >= 90:
                input_list[i] = input_list[j] # Keep the last encountered item
                # Use following line to keep the first encountered item
                # input_list[j] = input_list[i]

I'm glad you thought it was complex! As a beginner, I often start with a simple working idea, and then find it gets more and more complex as each of the edge cases have to be dealt with. At the same time, I've sometimes had tons of my code replaced by an experienced programmer because I've just not seen the simple solution. That's why I thought it would be educational to ask people what they made of this problem. I'll try your solution out and get back to you. — user4896331, Nov 27 '17 at 13:21
I have a similar question here https://stackoverflow.com/questions/64704015/nlp-classification-labels-have-many-similarirites-replace-to-only-have-one — Snorrlaxxx, Nov 05 '20 at 20:12
@Snorrlaxxx It has been a long time but, I believe yes, the given code applies to every column. `for column in list(df)` loop gets each column, finds first words in that column and calls `func` for the list of those first words. Then, a new version of the column is appended to `result` list. When the loop ends, the rows of this `result` list correspond to columns in the updated dataframe. The key point here is **using fuzzywuzzy to detect similar string**. — Alperen, Nov 05 '20 at 22:45
Thanks for responding, could you have a look at the question I posed? I believe it is very similar to the OPs question. — Snorrlaxxx, Nov 05 '20 at 23:03

Replace similar strings in a column with the same string

1 Answers1

Linked