I'm reading in a CSV file that has UTF8 encoding:
ifile = open(fname, "r")
for row in csv.reader(ifile):
name = row[0]
print repr(row[0])
This works fine, and prints out what I expect it to print out; a UTF8 encoded str
:
> '\xc3\x81lvaro Salazar'
> '\xc3\x89lodie Yung'
...
Furthermore when I simply print the str
(as opposed to repr()
) the output displays ok (which I don't understand eitherway - shouldn't this cause an error?):
> Álvaro Salazar
> Élodie Yung
but when I try to convert my UTF8 encoded strs
to unicode
:
ifile = open(fname, "r")
for row in csv.reader(ifile):
name = row[0]
print unicode(name, 'utf-8') # or name.decode('utf-8')
I get the infamous:
Traceback (most recent call last):
File "scripts/script.py", line 33, in <module>
print unicode(fullname, 'utf-8')
UnicodeEncodeError: 'ascii' codec can't encode character u'\xc1' in position 0: ordinal not in range(128)
So I looked at the unicode strings that are created:
ifile = open(fname, "r")
for row in csv.reader(ifile):
name = row[0]
unicode_name = unicode(name, 'utf-8')
print repr(unicode_name)
and the output is
> u'\xc1lvaro Salazar'
> u'\xc9lodie Yung'
So now I'm totally confused as these seem to be mangled hex values. I've read this question:
and it appears I am doing everything correctly, leading me to believe that my file is not actually UTF8, but when I initially print out the repr
values of the cells, they appear to to correct UTF8 hex values. Can anyone either point out my problem or indicate where my understanding is breaking down (as I'm starting to get lost in the jungle of encodings)
As an aside, I believe I could use codecs
to open the file and read it directly into unicode objects, but the csv
module doesn't support unicode natively so I can use this approach.