I'm encountering the following issue during a small project of mine. I'm having a large dataset where some string values are accidentally not written properly. My goal is to write a function that ensures that all names that look fairly similar (.75) will be looked for in a loop, and will get the same name. In the example below I described a subset of the data where "Bob Fisherman", "Bob Felony" & "Bob Haris" are the correct names. I would like to have the misspelled names changes to the above if they match
Here is a subset of the dataframe:
columns = ["Name", "Type","Amount", "Year"]
data = [("Bob fisherman", "Income", 150, 2022), ("Bob fisherman","Income", 100, 2021), ("Bob Felony", "Income", 100, 2021), ("Bob Felany", "Expense", 50, 2022), ("Bob Haris", "Expense", 100, 2022), ("Bob Disherman", "Expense", 100, 2021)]
data = spark.createDataFrame(data).toDF(*columns)
So eventually I would like to have something like this:
Name | Type | Amount | Year |
---|---|---|---|
Bob Fisherman | Income | 150 | 2022 |
Bob Fisherman | Income | 100 | 2021 |
Bob Felony | Income | 100 | 2021 |
Bob Felany | Income | 50 | 2022 |
Bob Haris | Income | 100 | 2022 |
Bob Felony | Income | 100 | 2021 |
Bob Fisherman | Income | 100 | 2022 |
In the example it only goes about Bob. But in the total sample, I have much more names so the use of pre-specified list is not going to cut it unfortunately.
I tried to get some inspiration from the following question but I didn't seem to make it work: Replace similar strings in a column with the same string