0

I am new in python. I have a data frame with a column, named 'Name'. The column contains different type of accents. I am trying to remove those accents. For example, rubén => ruben, zuñiga=zuniga, etc. I wrote following code:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
import unicodedata


data=pd.read_csv('transactions.csv')

data.head()

nm=data['Name']
normal = unicodedata.normalize('NFKD', nm).encode('ASCII', 'ignore')

I am getting error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-41-1410866bc2c5> in <module>()
      1 nm=data['Name']
----> 2 normal = unicodedata.normalize('NFKD', nm).encode('ASCII', 'ignore')

TypeError: normalize() argument 2 must be unicode, not Series
cottontail
  • 10,268
  • 18
  • 50
  • 51
user3642360
  • 762
  • 10
  • 23
  • `print(nm)` and `print(type(nm))` - This will prove that you don't have unicode but a Series object. – JacobIRR Jun 12 '18 at 17:27
  • I know that I have series object. But my objective is to remove the accents from the column. Is there any other way to do it? – user3642360 Jun 12 '18 at 17:31
  • Loop over each element in the series. `print([i for i in nm])` will show you each item so you know how to extract the right value from each item. – JacobIRR Jun 12 '18 at 17:34

3 Answers3

0

The reason why it is giving you that error is because normalize requires a string for the second parameter, not a list of strings. I found an example of this online:

unicodedata.normalize('NFKD', u"Durrës Åland Islands").encode('ascii','ignore')
'Durres Aland Islands'
mmghu
  • 595
  • 4
  • 15
0

Try this for one column:

nm = nm.str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8')

Try this for multiple columns:

obj_cols = data.select_dtypes(include=['O']).columns
data.loc[obj_cols] = data.loc[obj_cols].apply(lambda x: x.str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8'))
Cibic
  • 316
  • 5
  • 14
  • Thanks. But I am getting wrong value with the code: nm = nm.str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8'). Example, lizeth muñoz sanchez =>lizeth muAoz sanchez – user3642360 Jun 12 '18 at 17:49
  • It's the unicode form probably. Try this maybe? nm.str.lower().str.decode('utf-8').map(lambda x: unicodedata.normalize('NFKD', x)) .str.encode('ascii', 'ignore')) – Cibic Jun 12 '18 at 18:42
0

Try this for one column:

df[column_name] = df[column_name].apply(lambda x: unicodedata.normalize(u'NFKD', str(x)).encode('ascii', 'ignore').decode('utf-8'))

Change the column name according to your data columns.