Python string encodings and ==

Question

I am having some trouble with strings in python not being == when I think they should be, and I believe it has something to do with the way they are encoded. Basically, I parsing some comma-separated values that are stored in zip archives (GTFS feeds specifically, for those who are curious).

I'm using the ZipFile module in python to open certain files the zip archives and then comparing the text there to some known values. Here's an example file:

agency_id,agency_name,agency_url,agency_phone,agency_timezone,agency_lang
ARLC,Arlington Transit,http://www.arlingtontransit.com,703-228-7433,America/New_York,en

The code I'm using is trying to identify the position of the string "agency_id" in the first line of the text so that I can use the corresponding value in any subsequent lines. Here's a snippet of the code:

zipped_feed = ZipFile(feed_name, "r")
agency_file = zipped_feed.open("agency.txt", "r")

line_num = 0
agencyline = agency_file.readline()
while agencyline:
    if line_num == 0:
        # this is the header, all we care about is the agency_id
        lineparts = agencyline.split(",")
        position = -1
        counter = 0
        for part in lineparts:
            part = part.strip()
            if part == "agency_id":
                position = counter              
        counter += 1
        line_num += 1
        agencyline = agency_file.readline()
    else:
        .....

This code works for some zip archives, but not for others. I did some research and tried printing repr(part) and i got '\xef\xbb\xbfagency_id' instead of 'agency_id'. Does anyone know what's going on here and how I can fix it? Thanks for all the help!

Note that if code is to be executed on only the first (or last) iteration, it's more performant and clearer to move that code before (or after) the loop. Also, you can use `position = lineparts.index('agency_id')` to find the position of the desired field in a line and `for agencyline in agency_file` to loop over remaining lines in the file. Once your program runs correctly, you may want to post it on [codereview.SE](http://codereview.stackexchange.com/) for more feedback. — outis, Jun 02 '12 at 17:54
UTF-8 files should not have BOMs: they are “neither required nor recommended” by the Unicode Standard. This smells like a Windows bug. — tchrist, Jun 03 '12 at 04:50

Kjir · Accepted Answer · 2012-06-03T00:02:08.497

5

That is a Byte Order Mark, which tells the encoding of the file and in the case of UTF-16 and UTF-32 it also tells the endianess of the file. You can either interpret it or check for it and remove it from your string. To remove it you could do this:

import codecs

unicode(part, "utf8").lstrip(codecs.BOM_UTF8.decode("utf8", "strict"))

edited Jun 03 '12 at 00:02

answered Jun 02 '12 at 17:45

Kjir

4,437
4
29
34

Tested with python 3.2.3 and 2.7.3 – Kjir Jun 02 '12 at 18:19
I get the following error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128) – jmetz Jun 02 '12 at 22:09

mata · Answer 2 · 2012-06-02T17:52:08.737

3

Your input file seems to be utf-8 and starting with a 'ZERO WIDTH NO-BREAK SPACE'-character,

import unicodedata
unicodedata.name('\xef\xbb\xbf'.decode('utf8'))
# gives: 'ZERO WIDTH NO-BREAK SPACE'

which is used as a BOM (or more accurately to identify the file as being utf8, as byte order isn't really accurate with utf8, but it's commonly called BOM anyway)

edited Jun 02 '12 at 17:52

answered Jun 02 '12 at 17:45

mata

67,110
10
163
162

The BOM has the same hex code as the zero-width nbsp; this is probably a BOM. – beerbajay Jun 02 '12 at 17:47

score 0 · Answer 3 · answered Jun 02 '12 at 17:48

Simple: some of your zip archives are printing the Unicode BOM (Byte Order Mark) at the beginning of the string. This is used to indicate the byte order for use with multi-byte encodings. This means you're reading in a Unicode string (probably UTF-16 encoded) as a bytestring. Easiest thing to do would be check for it at the start of the string and remove it.

score 0 · Answer 4 · answered Jun 02 '12 at 17:53

What you've got is a file that may occasionally have a Unicode Byte Order mark at the front of the file. Sometimes this is introduced by editors to indicate encoding.

Here's some details - http://en.wikipedia.org/wiki/Byte_order_mark

Bottom line is that you could look for the \xef\xbb\xbf string which is the marker for UTF-8 encoded data and just strip it. Or the other choice is to open it with the codecs package

with codecs.open('input', 'r', 'utf-8') as file:

or in your case

zipped_feed = ZipFile(feed_name, "r")
# adding a StreamReader around the zipped_feed.open(...)
agency_file = codecs.StreamReader(zipped_feed.open("agency.txt", "r"))

Python string encodings and ==

4 Answers4

Linked