Reliable way of handling non-ASCII characters in Python?

Question

I have a column a spreadsheet whose header contains non-ASCII characters thus:

'ï»¿Campaign'

If I pop this string into the interpreter, I get:

'\xc3\xaf\xc2\xbb\xc2\xbfCampaign'

The string is one the keys in the rows of a csv.DictReader()

When I try to populate a new dict with with the value of this key:

spends['ï»¿Campaign'] = 2

I get:

Key Error: '\xc3\xaf\xc2\xbb\xc2\xbfCampaign'

If I print the value of the keys of row, I can see that it is '\xef\xbb\xbfCampaign'

Obviously then I can just update my program to access this key thus:

spends['\xef\xbb\xbfCampaign']

But is there a "better" way of doing this, in Python? Indeed, if the value of this key every changes to contain other non-ASCII characters, what is an all-encompassing way of handling any all non-ASCII characters that may arise?

try `spends[u'æ'] = 2` and see [this similar question for more](https://stackoverflow.com/questions/16437245/python-2-7-unicode-dict) — LinkBerest, Jul 07 '15 at 18:36

score 5 · Answer 1 · edited May 23 '17 at 12:15

5

Your specific problem is the first three bytes of the file, "\xef\xbb\xbf". That's the UTF-8 encoding of the byte order mask and often prepended to text files to indicate they're encoded using UTF-8. You should strip these bytes. See Removing BOM from gzip'ed CSV in Python.

Second, you're decoding with the wrong codec. "ï»¿" is what you get if you decode those bytes using the Windows-1252 character set. That's why the bytes look different if you use these characters in a source file. See the Python 2 Unicode howto.

edited May 23 '17 at 12:15

Community

1
1

answered Jul 07 '15 at 23:34

roeland

5,349
2
14
28

2

you could use `'utf-8-sig'` encoding to deal with BOM automatically. – jfs Jul 08 '15 at 12:07

jfs · Accepted Answer · 2015-07-08T12:03:56.300

3

In general, you should decode a bytestring into Unicode text using the corresponding character encoding as soon as possible on input. And, in reverse, encode Unicode text into a bytestring as late as possible on output. Some APIs such as io.open() can do it implicitly so that your code sees only Unicode.

Unfortunately, csv module does not support Unicode directly on Python 2. See UnicodeReader, UnicodeWriter in the doc examples. You could create their analog for csv.DictReader or as an alternative just pass utf-8 encoded bytestrings to csv module.

edited Jul 08 '15 at 12:03

answered Jul 07 '15 at 21:32

jfs

399,953
195
994
1,670

1

I presume your first `cvs` was intended to be read `csv`. :) I strongly encourage folks to take your suggested route of implicit codecs wherever and whenever possible, such as via `io.open()`. It really makes their lives a huge whole lot easier. I sometimes get so used to this that I forget there's any other way — until I bump up against a library that only handles bytes data instead of character data. Suddenly you drop out of the comfortable constructed world of a high-level language into the rabbit hole of device drivers and bit twiddling and terrors even worse than those. – tchrist Jul 08 '15 at 11:14
1

`csv` is such an example. in Python 2.x, `csv` reads *bytes*, rather than *text*. The post linked by Sebastian gives a good way to at least deal with the conversion in a single place. – roeland Jul 09 '15 at 23:19

Reliable way of handling non-ASCII characters in Python?

2 Answers2

Linked