1

I'm trying to make a list of locations from a column of a csv file in Python.

This is one entry in the column:

Rio Balira del Orien,Riu Valira d'Orient,Riu Valira d’Orient,Río Balira del Orien

This is the corresponding list in its current state:

locs = ['Rio Balira del Orien', "Riu Valira d'Orient", 'Riu Valira d\xe2\x80\x99Orient', 'R\xc3\xado Balira del Orien']

In my program, I need to check if a given word is in the list, so I'm trying to remove the crazy string formatting (ex. \xc3\xad = í) for accented letters, apostrophes, etc. and just have each location be in simple lowercase ascii. When I try to use the code

loclist = [x.encode('ascii').lower() for x in locs]

it throws the error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 12: ordinal not in range(128)

What command should I use instead?

Thanks!

dano
  • 91,354
  • 19
  • 222
  • 219
user3753722
  • 153
  • 2
  • 11

2 Answers2

1
locs = ['Rio Balira del Orien', "Riu Valira d'Orient", 'Riu Valira d\xe2\x80\x99Orient', 'R\xc3\xado Balira del Orien']

To remove completely:

print [unicode(x,errors="ignore") for x in locs]

[u'Rio Balira del Orien', u"Riu Valira d'Orient", u'Riu Valira dOrient', u'Ro Balira del Orien']

To encode to ascii.

import unicodedata
print [unicodedata.normalize('NFD', x.decode('utf-8')).encode('ascii', 'ignore') for x in locs]

['Rio Balira del Orien', "Riu Valira d'Orient", 'Riu Valira dOrient', 'Rio Balira del Orien']
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
  • I guess this works, if the OP is ok with losing the accented characters altogether. – dano Jun 23 '14 at 15:33
0

You can't encode accented characters as ascii, you have to use an expanded encoding type which supports a larger character set. Right now, you have a list containing UTF-8 encoded strings, which is a reasonable way to store them. You could decode them to unicode objects instead, which is a good best practice:

>>> [l.decode('utf-8') for l in locs]
[u'Rio Balira del Orien', u"Riu Valira d'Orient", u'Riu Valira d\u2019Orient', u'R\xedo Balira del Orien']

You would just need to make sure you re-encoded the strings before doing things like writing them to disk, which require an encoded string. You can do that by calling encode('utf-8') on the unicode object.

dano
  • 91,354
  • 19
  • 222
  • 219