How to reformat strings to not include accented letters in Python?

Question

I'm trying to make a list of locations from a column of a csv file in Python.

This is one entry in the column:

Rio Balira del Orien,Riu Valira d'Orient,Riu Valira d’Orient,Río Balira del Orien

This is the corresponding list in its current state:

locs = ['Rio Balira del Orien', "Riu Valira d'Orient", 'Riu Valira d\xe2\x80\x99Orient', 'R\xc3\xado Balira del Orien']

In my program, I need to check if a given word is in the list, so I'm trying to remove the crazy string formatting (ex. \xc3\xad = í) for accented letters, apostrophes, etc. and just have each location be in simple lowercase ascii. When I try to use the code

loclist = [x.encode('ascii').lower() for x in locs]

it throws the error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 12: ordinal not in range(128)

What command should I use instead?

Thanks!

Padraic Cunningham · Answer 1 · 2014-06-23T16:33:41.870

1

locs = ['Rio Balira del Orien', "Riu Valira d'Orient", 'Riu Valira d\xe2\x80\x99Orient', 'R\xc3\xado Balira del Orien']

To remove completely:

print [unicode(x,errors="ignore") for x in locs]

[u'Rio Balira del Orien', u"Riu Valira d'Orient", u'Riu Valira dOrient', u'Ro Balira del Orien']

To encode to ascii.

import unicodedata
print [unicodedata.normalize('NFD', x.decode('utf-8')).encode('ascii', 'ignore') for x in locs]

['Rio Balira del Orien', "Riu Valira d'Orient", 'Riu Valira dOrient', 'Rio Balira del Orien']

edited Jun 23 '14 at 16:33

answered Jun 23 '14 at 15:28

Padraic Cunningham

176,452
29
245
321

I guess this works, if the OP is ok with losing the accented characters altogether. – dano Jun 23 '14 at 15:33

score 0 · Answer 2 · answered Jun 23 '14 at 15:29

You can't encode accented characters as ascii, you have to use an expanded encoding type which supports a larger character set. Right now, you have a list containing UTF-8 encoded strings, which is a reasonable way to store them. You could decode them to unicode objects instead, which is a good best practice:

>>> [l.decode('utf-8') for l in locs]
[u'Rio Balira del Orien', u"Riu Valira d'Orient", u'Riu Valira d\u2019Orient', u'R\xedo Balira del Orien']

You would just need to make sure you re-encoded the strings before doing things like writing them to disk, which require an encoded string. You can do that by calling encode('utf-8') on the unicode object.

How to reformat strings to not include accented letters in Python?

2 Answers2