How to remove every possible accents from a column in python

Question

I am new in python. I have a data frame with a column, named 'Name'. The column contains different type of accents. I am trying to remove those accents. For example, rubén => ruben, zuñiga=zuniga, etc. I wrote following code:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
import unicodedata


data=pd.read_csv('transactions.csv')

data.head()

nm=data['Name']
normal = unicodedata.normalize('NFKD', nm).encode('ASCII', 'ignore')

I am getting error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-41-1410866bc2c5> in <module>()
      1 nm=data['Name']
----> 2 normal = unicodedata.normalize('NFKD', nm).encode('ASCII', 'ignore')

TypeError: normalize() argument 2 must be unicode, not Series

`print(nm)` and `print(type(nm))` - This will prove that you don't have unicode but a Series object. — JacobIRR, Jun 12 '18 at 17:27
I know that I have series object. But my objective is to remove the accents from the column. Is there any other way to do it? — user3642360, Jun 12 '18 at 17:31
Loop over each element in the series. `print([i for i in nm])` will show you each item so you know how to extract the right value from each item. — JacobIRR, Jun 12 '18 at 17:34

score 0 · Answer 1 · answered Jun 12 '18 at 17:35

The reason why it is giving you that error is because normalize requires a string for the second parameter, not a list of strings. I found an example of this online:

unicodedata.normalize('NFKD', u"Durrës Åland Islands").encode('ascii','ignore')
'Durres Aland Islands'

score 0 · Answer 2 · answered Jun 12 '18 at 17:43

0

Try this for one column:

nm = nm.str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8')

Try this for multiple columns:

obj_cols = data.select_dtypes(include=['O']).columns
data.loc[obj_cols] = data.loc[obj_cols].apply(lambda x: x.str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8'))

answered Jun 12 '18 at 17:43

Cibic

316
5
14

Thanks. But I am getting wrong value with the code: nm = nm.str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8'). Example, lizeth muñoz sanchez =>lizeth muAoz sanchez – user3642360 Jun 12 '18 at 17:49
It's the unicode form probably. Try this maybe? nm.str.lower().str.decode('utf-8').map(lambda x: unicodedata.normalize('NFKD', x)) .str.encode('ascii', 'ignore')) – Cibic Jun 12 '18 at 18:42

score 0 · Answer 3 · answered May 30 '22 at 10:28

0

Try this for one column:

df[column_name] = df[column_name].apply(lambda x: unicodedata.normalize(u'NFKD', str(x)).encode('ascii', 'ignore').decode('utf-8'))

Change the column name according to your data columns.

answered May 30 '22 at 10:28

Manivannan S

11
2

How to remove every possible accents from a column in python

3 Answers3