Remove accents while comparing excel files using python

Question

i want to compare two excel files and delete rows with common names in the second file. the issue is that one list doesn't have accents while one does. i want to remove the accents in order to compare the names, delete from the second file and then merge the two files. thanks!

code so far

import pandas as pd
from unidecode import unidecode

df0 = pd.ExcelFile('Players.xlsx').parse('Sheet1')
df1 = pd.ExcelFile('Players.xlsx').parse('Sheet1')

result = df0[df0[(unidecode('Name'))].isin(df1[unidecode('Name')])]
result.to_excel('Merged.xlsx', index=False)

Does this answer your question? [What is the best way to remove accents in a Python unicode string?](https://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-in-a-python-unicode-string) — Corentin Limier, Feb 03 '20 at 14:27
not really because it seems like unidecode only works with a string — J. Doe, Feb 03 '20 at 14:46

arnaud · Accepted Answer · 2020-02-03T15:26:31.227

1

As pointed in the comments by Corentin, you want to remove accents from your second excel file.

from unidecode import unidecode
df1["Name"] = df1["Name"].apply(unidecode)  # strip accents from Name column
result = df1[~df1["Name"].str.isin(df0["Name"].unique())]  # keep only if name not in df0

Please note that:
- The ~ operator is equivalent to a NOT boolean operator.
- using df0.Name.unique() enables to lookup only on unique (i.e. distinct) name values in df0.

edited Feb 03 '20 at 15:26

answered Feb 03 '20 at 15:04

arnaud

3,293
1
10
27

i keep getting the error ```AttributeError: 'DataFrame' object has no attribute 'Name'``` even if i try to define the name column of df1 – J. Doe Feb 03 '20 at 15:21
Oh, you should check your column names then. Is it really `Name` for both? Check by using `print(df0.columns, df1.columns)`. I guess the logic is right, just need to make sure your columns are properly named. – arnaud Feb 03 '20 at 15:25
the columns are actually Player Names, which would obviously give syntax errors for the result line due to the space. do you know the correct way to write it out? – J. Doe Feb 03 '20 at 15:28
Yea sure, avoid using `df.ColumnName` but instead use `df["ColumnName"]` which won't get you into troubles when there is a space. Hence, use `df["Player Names"]`. Also edited my answer. – arnaud Feb 03 '20 at 15:38

score 0 · Answer 2 · answered Feb 03 '20 at 15:05

You can normalize strings with accents using unicodedata.normalize

Code:

import unicodedata
my_str = 'àbćdë'
str(unicodedata.normalize('NFKD', s).encode('ascii', 'ignore'), 'utf-8')

Output:

'abcde'

To apply to a DataFrame use:

def normalize(s):
    return str(unicodedata.normalize('NFKD', s).encode('ascii', 'ignore'), 'utf-8')

df1.Name.apply(normalize)

Remove accents while comparing excel files using python

2 Answers2