Python: Convert Unicode to ASCII without errors for CSV file

Question

I've been reading all questions regarding conversion from Unicode to CSV in Python here in StackOverflow and I'm still lost. Everytime I receive a "UnicodeEncodeError: 'ascii' codec can't encode character u'\xd1' in position 12: ordinal not in range(128)"

buffer=cStringIO.StringIO()
writer=csv.writer(buffer, csv.excel)
cr.execute(query, query_param)
while (1):
    row = cr.fetchone()
    writer.writerow([s.encode('ascii','ignore') for s in row])

The value of row is

(56, u"LIMPIADOR BA\xd1O 1'5 L")

where the value of \xd10 at the database is ñ, a n with a diacritical tilde used in Spanish. At first I tried to convert the value to something valid in ascii, but after losing so much time I'm trying only to ignore those characters (I suppose I'd have the same problem with accented vowels).

I'd like to save the value to the CSV, preferably with the ñ ("LIMPIADOR BAÑO 1'5 L"), but if not possible, at least be able to save it ("LIMPIADOR BAO 1'5 L").

Why don't you try encoding to your local Windows 'ANSI' codepage? I'm guessing you are using Windows because CSV is most commonly used on Windows, but please ignore me if this is wide of the mark. On a *NIX OS I guess one of the 8 bit ISO encodings would be appropriate, but I'm no expert. — David Heffernan, Jan 10 '11 at 19:43

score 13 · Accepted Answer · edited May 23 '17 at 12:13

13

Correct, ñ is not a valid ASCII character, so you can't encode it to ASCII. So you can, as your code does above, ignore them. Another way, namely to remove the accents, you can find here: What is the best way to remove accents in a Python unicode string?

But note that both techniques can result in bad effects, like making words actually mean something different, etc. So the best is to keep the accents. And then you can't use ASCII, but you can use another encoding. UTF-8 is the safe bet. Latin-1 or ISO-88591-1 is common one, but it includes only Western European characters. CP-1252 is common on Windows, etc, etc.

So just switch "ascii" for whatever encoding you want.

Your actual code, according to your comment is:

writer.writerow([s.encode('utf8') if type(s) is unicode else s for s in row])

where

row = (56, u"LIMPIADOR BA\xd1O 1'5 L")

Now, I believe that should work, but apparently it doesn't. I think unicode gets passed into the cvs writer by mistake anyway. Unwrap that long line to it's parts:

col1, col2 = row # Use the names of what is actually there instead
row = col1, col2.encode('utf8')
writer.writerow(row)

Now your real error will not be hidden by the fact that you stick everything in the same line. This could also probably have been avoided if you had included a proper traceback.

edited May 23 '17 at 12:13

Community

1
1

answered Jan 10 '11 at 19:38

Lennart Regebro

167,292
41
224
251

Most CSV readers can't handle UTF-8. CSV is most often read on Windows and so one of the so called ANSI encodings would seem most appropriate. – David Heffernan Jan 10 '11 at 19:41
Well, I don't know about "most", but some at least. And CSV is used everywhere. Obviously an encoding that can be read by the target software needs to be used. – Lennart Regebro Jan 10 '11 at 19:43
The thing is that the above code, using the unicode string u"LIMPIADOR BA\xd1O 1'5 L" fails miserably with the UnicodeEncodeError error also detailed above (the "ignore" flag is not working, I don't know why). In ideal conditions I'd like to get a CSV file with the full string, including the ñ. – Sergi Jan 10 '11 at 21:53
@Sergi: Well, it should work, I'd have to actually debug your code to know what is happening. But in any case: Use a better encoding than ascii. – Lennart Regebro Jan 10 '11 at 21:59
@Lennart: ok, how should I export to CSV them? The "encode" is failing even using utf-8. – Sergi Jan 10 '11 at 22:04
@Sergi: Then there is something else wrong. `u"LIMPIADOR BA\xd1O 1'5 L".encode('utf8')` works just fine here. As does `u"LIMPIADOR BA\xd1O 1'5 L".encode('ascii', 'ignore')`. So there is something else that is wrong. Note that your claim that row is `(56, u"LIMPIADOR BA\xd1O 1'5 L")` can not be true, as you would get `'int' object has no attribute 'encode'`, but that's not the error you claim you have. So something in your description of the problem is incorrect. – Lennart Regebro Jan 10 '11 at 22:21
@Lennart, sorry for the delay, your comment was hidden. You are right, the code is ok, and the error was elsewhere. In fact the problems is that I wrote `writer.writerow([s.encode('utf8') if type(s) is str else s for s in row])` to encode only strings, when in fact I had to check for unicode `if type(s) is unicode`. Points awarded. – Sergi Jan 11 '11 at 17:17
OK, that's a bit strange, but I believe your problem is that you are passing unicode to writerow(). Don't stick everything into one line of code, there is no point. :) – Lennart Regebro Jan 11 '11 at 17:19
@Lennart, you're right, but think that the row is a tuple coming from a query via fetchone in a tuple, so I did it in one line to avoid having to copy it to a list while reencoding and then pass the values to the writerow (btw, python newbie ;) – Sergi Jan 11 '11 at 18:01

Python: Convert Unicode to ASCII without errors for CSV file

1 Answers1

Linked