How to correct the sates names using dictionary in python

Question

I currently have a csv file. The data originally is derived from PDF and doing a further analysis on the data, There is an [States] column where some the values are spelled wrong, i.e it contains wrong spelling or integers.

I need instead of numbers or the wrong spelling in the rows, It can parse First 3 characters or last 3 Characters and match it with the dictionary Firstly it needs to match with first 3 character if it is not matching then look for the last 3 character and gives the correct values to be saved into it.

Such as in the following example:

state_name
Assan
Andhra Prade5h
M1zoram
Uttar Pr8desh

Expected Output:

state_name
Assam
Andhra Pradesh
Mizoram
Uttar Pradesh

What I have tried so far:

dict = {'Assam','Andhra Pradesh', 'Mizoram', 'Uttar Pradesh'} 

##Didn't worked
df['state_name'] = df['state_name'].map(dict).fillna(df['state_name'])

##Then Tried Using the below method but not able to perform the further process

state1 = df['state_name'].str[3:]
state2 = df['state_name'].str[:3]

So I have no clue how to handle this problem.

Is your goal is to find the most likely matching state name? — C14L, Mar 24 '21 at 16:43
@C14L - Yes exactly, I want to match most likely names from dictionary after parsing — Manz, Mar 24 '21 at 16:47
Why do this requirement on the first 3 and last three, you'd be a lot safer doing a partial ratio — ALollz, Mar 24 '21 at 16:53

score 2 · Accepted Answer · answered Mar 24 '21 at 16:53

2

Here is a solution. Basically it compares your input with your correct name list and return best match(the most similar):

from difflib import SequenceMatcher

lst = ['Assam','Andhra Pradesh', 'Mizoram', 'Uttar Pradesh'] #Correct Name List

def closest(s):
    highest = 0
    result = ''
    for i in lst:
        temp = SequenceMatcher(None, s, i).ratio() #Similarity Ratio
        if temp > highest:
            highest = temp
            result = i
    return result


df = pd.DataFrame(['Assan','Andhra Prade5h','M1zoram','Uttar Pr8desh'], columns = ["state_name"])

#Output Before
    state_name
0   Assan
1   Andhra Prade5h
2   M1zoram
3   Uttar Pr8desh


df['state_name'] = df['state_name'].apply(lambda x: closest(x))

#Output After
    state_name
0   Assam
1   Andhra Pradesh
2   Mizoram
3   Uttar Pradesh

answered Mar 24 '21 at 16:53

Inputvector

1,061
10
22

Error while installing the library – Manz Mar 24 '21 at 19:41
What error do you get? It should be already installed – Inputvector Mar 24 '21 at 20:09
Could not fetch URL https://pypi.org/simple/cdifflib/: There was a problem confirming the ssl certificate: HTTPSConnectionPool(host='pypi.org', port=443): Max retries exceeded with url: /simple/cdifflib/ (Caused by SSLError(SSLError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:833)'),)) – Manz Mar 24 '21 at 20:10
1

Did you tried to import ```difflib``` directly? Because it is a default library. You are trying to install ```cdifflib``` Also, Check this [question](https://stackoverflow.com/questions/25981703/pip-install-fails-with-connection-error-ssl-certificate-verify-failed-certi) – Inputvector Mar 24 '21 at 20:17
What would be method if the Input value is case sensitive. – Manz Mar 26 '21 at 09:08
You can convert the input lowercase like ```.lower()``` so your apply function looks like: ```df['state_name'] = df['state_name'].apply(lambda x: closest(x.lower()))``` Note that you need to convert your correct list to lower as well. – Inputvector Mar 26 '21 at 16:37

score 2 · Answer 2 · answered Mar 24 '21 at 17:26

here is one with product and Levenshtein , you can test to see how it performs on the original df:

import Levenshtein
d = {'Assam','Andhra Pradesh', 'Mizoram', 'Uttar Pradesh'}

i = pd.MultiIndex.from_product((df['state_name'],d))
def f(x) : return Levenshtein.ratio(*x)
out = pd.DataFrame.from_records(i,columns=['Inp','Output']).assign(Score=i.map(f))
out = out.loc[out.groupby('Inp')['Score'].idxmax()]

print(out)
               Inp          Output     Score
7   Andhra Prade5h  Andhra Pradesh  0.928571
2            Assan           Assam  0.800000
9          M1zoram         Mizoram  0.857143
12   Uttar Pr8desh   Uttar Pradesh  0.923077

Library is not getting installed – Manz Mar 24 '21 at 19:44 — Manz, Mar 24 '21 at 19:44
@Manish run pip install Levenshtein on your shell of python – anky Mar 24 '21 at 20:16 — anky, Mar 24 '21 at 20:16

ALollz · Answer 3 · 2021-03-24T17:10:44.263

You can use a partial_ratio to find the closest match from your list of matches. Because this will be slow, you can first check for exact matches and then correct only the others. The output contains the original value, what was matched and then if it was a fuzzy match the match ratio, which you can use to determine the quality of the match.

Setup

import pandas as pd
from itertools import zip_longest
from fuzzywuzzy import fuzz

df = pd.DataFrame({'state_name': ['Assan', 'Andhra Prade5h', 'M1zoram', 'Uttar Pr8desh',
                                  'Assam']})
matches = ['Assam', 'Andhra Pradesh', 'Mizoram', 'Uttar Pradesh']

Code

# Exact Matches
df['match'] = df['state_name'].where(df['state_name'].isin(matches))

# Split to figure out what didn't match exactly
m = df['match'].notnull()
df_corr = df[m].copy()   # These are fine, exact matches
df_inco = df[~m].copy()  # Need to fix

# Only if something needs to be fixed.
if not df_inco.empty:
    l = []
    for idx, row in df_inco.iterrows():
        z = zip_longest([row.state_name], matches, fillvalue=row.state_name)
        s = pd.DataFrame([(x[1], fuzz.ratio(x[0], x[1])) for x in z], 
                         columns=['match', 'fuzz_ratio'])
        s = s.loc[[s['fuzz_ratio'].idxmax()]]  # Best Match
        s.index = [idx] # Set index so we can just combine first after the loop.
        l.append(s)

    # Add the best match based on the fuzzy ratio
    df_inco = df_inco.combine_first(pd.concat(l)) 

    df = pd.concat([df_inco, df_corr], sort=False).sort_index()

Output

print(df)

   fuzz_ratio           match      state_name
0        80.0           Assam           Assan
1        93.0  Andhra Pradesh  Andhra Prade5h
2        86.0         Mizoram         M1zoram
3        92.0   Uttar Pradesh   Uttar Pr8desh
4         NaN           Assam           Assam

Error while installing the library – Manz Mar 24 '21 at 19:44 — Manz, Mar 24 '21 at 19:44

How to correct the sates names using dictionary in python

3 Answers3

Setup

Code

Output