Replacing similar strings in the column by using the same for both

Question

I'm encountering the following issue during a small project of mine. I'm having a large dataset where some string values are accidentally not written properly. My goal is to write a function that ensures that all names that look fairly similar (.75) will be looked for in a loop, and will get the same name. In the example below I described a subset of the data where "Bob Fisherman", "Bob Felony" & "Bob Haris" are the correct names. I would like to have the misspelled names changes to the above if they match

Here is a subset of the dataframe:

columns = ["Name", "Type","Amount", "Year"]
data = [("Bob fisherman", "Income", 150, 2022), ("Bob fisherman","Income", 100, 2021), ("Bob Felony", "Income", 100, 2021), ("Bob Felany", "Expense", 50, 2022), ("Bob Haris", "Expense", 100, 2022), ("Bob Disherman", "Expense", 100, 2021)]
data = spark.createDataFrame(data).toDF(*columns)

So eventually I would like to have something like this:

Name	Type	Amount	Year
Bob Fisherman	Income	150	2022
Bob Fisherman	Income	100	2021
Bob Felony	Income	100	2021
Bob Felany	Income	50	2022
Bob Haris	Income	100	2022
Bob Felony	Income	100	2021
Bob Fisherman	Income	100	2022

In the example it only goes about Bob. But in the total sample, I have much more names so the use of pre-specified list is not going to cut it unfortunately.

I tried to get some inspiration from the following question but I didn't seem to make it work: Replace similar strings in a column with the same string

Laurent · Accepted Answer · 2022-10-02T18:59:10.020

With the following toy dataframe:

import pandas as pd

df = pd.DataFrame(
    [
        ("Bob Fisherman", "Income", 150, 2022),
        ("Bob Heris", "Income", 100, 2021),
        ("Bob Felony", "Income", 100, 2021),
        ("Bob Felany", "Expense", 50, 2022),
        ("Bob Haris", "Expense", 100, 2022),
        ("Bob Disherman", "Expense", 100, 2021),
    ],
    columns=["Name", "Type", "Amount", "Year"],
)
print(df)
# Output
            Name     Type  Amount  Year
0  Bob Fisherman   Income     150  2022
1      Bob Heris   Income     100  2021
2     Bob Felony   Income     100  2021
3     Bob Felany  Expense      50  2022
4      Bob Haris  Expense     100  2022
5  Bob Disherman  Expense     100  2021

Here is one way to do it with the help of SequenceMatcher class from Python standard library difflib module:

from difflib import SequenceMatcher

def similar(a, b):
    """Get similarity ratio between a and b.

    Args:
        a: value.
        b: other value.

    Returns:
        Similatity ratio.

    """
    return SequenceMatcher(None, a, b).ratio()


def replace_similar_values(df, col, value):
    """Main helper function.

    Args:
        df: target dataframe.
        col: target column.
        value: value to use for replacement.

    Returns:
        Modified dataframe.

    """
    df = df.assign(
        Match=df[col].map(
            lambda x: max(
                [similar(x, value)],
                key=lambda x: x if x != 1 else 0,
            )
        )
    )
    df["Name"] = df.apply(lambda x: value if x["Match"] >= 0.8 else x["Name"], axis=1)
    df = df.drop(columns="Match")
    return df

And then:

correct_names = ["Bob Fisherman", "Bob Felony", "Bob Haris"]

for correct_name in correct_names:
    df = replace_similar_values(df, "Name", correct_name)

print(df)
# Output
            Name     Type  Amount  Year
0  Bob Fisherman   Income     150  2022
1      Bob Haris   Income     100  2021
2     Bob Felony   Income     100  2021
3     Bob Felony  Expense      50  2022
4      Bob Haris  Expense     100  2022
5  Bob Fisherman  Expense     100  2021

Replacing similar strings in the column by using the same for both

1 Answers1