Dataframe text column with spelling mistakes

Question

Sorry for bad title, I wasn't sure how best to describe the issue.

I have a dataframe with a column for occupation, df['occupation'], users can enter their occupuation during signup using whatever terms they like.

I'm attempting to do an EDA on the column, however, I'm not sure how to clean the column to get it from this,

Occupation
a-level student
a level
alavls
university physics student
physics student
6th form student
builder

Into something like this,

Occupation
a-levels
University student
Full time employment

Without writing out hundreds of lines renaming each unique entry.

TYIA

Any help or links to useful modules would be great.

How many rows or entries do you have? Do you have a list of authoritative occupations like 'university student' and so on? If there are not too many rows, the quickest and most accurate thing may be to just examine them manually. — Cam, Apr 20 '22 at 14:29
If you can't examine and fix them manually, this is going to be a Very Hard problem. — RoadieRich, Apr 20 '22 at 14:31
I have over 300 unique values for around 250000 rows.... going to guess it'll be a very hard problem. — Databoi, Apr 20 '22 at 14:34

Yehdhih ANNA · Accepted Answer · 2022-04-21T11:49:58.850

The simplest way to do this is by applying a function that measures the similarity between the two sentences, there are plenty of similiraty mesures that could be used in this context like the hamming distance, however they are all relatively very limited, and you might be forced at some point -if in production- to have a machine learning model for this task.

import pandas as pd

def hamming_distance(chaine1, chaine2):
    """mesures the similiraty between two chaines of caracters
    Note: this is a very limited as it only examines the positions of the letters in both chaines.
    """
    return sum(c1 != c2 for c1, c2 in zip(chaine1, chaine2))

OCCUPATIONS = [ "Occupation","a-levels" , "University student" , "Full time employment"]
def get_most_similar(ocup,OCCUPATIONS):
    """return the most similar occupation from the unique values OCCUPATIONs to the entry ocup
    """
    return min([(oc,hamming_distance(ocup.lower(),oc.lower())) for oc in OCCUPATIONS],key=lambda item:item[1])[0]

column = ["Occupation","a-level student","a level","alavls","university physics student","physics student","6th form student","builder"]
df = pd.DataFrame(column,columns=['occupation'])  # this is just a reconstruction of your dataframe you probably don't need this line.

df['occupation']=df['occupation'].apply(lambda ocup : get_most_similar(ocup,OCCUPATIONS))
df.head(100)

This is a really nice bit of code and it did the job, thank you!!! — Databoi, Apr 21 '22 at 09:31

score 0 · Answer 2 · answered Apr 20 '22 at 15:10

0

Hi the approach you can possibly use in this problem is similar to the Solution Covered here Using apply map with Regex

The Regex approach will allow you to use Wildcards for cases you have not explored in your Dataset.

answered Apr 20 '22 at 15:10

theseeker

1,281
1
7
11

Dataframe text column with spelling mistakes

2 Answers2