1

I am having some trouble with strings in python not being == when I think they should be, and I believe it has something to do with the way they are encoded. Basically, I parsing some comma-separated values that are stored in zip archives (GTFS feeds specifically, for those who are curious).

I'm using the ZipFile module in python to open certain files the zip archives and then comparing the text there to some known values. Here's an example file:

agency_id,agency_name,agency_url,agency_phone,agency_timezone,agency_lang
ARLC,Arlington Transit,http://www.arlingtontransit.com,703-228-7433,America/New_York,en

The code I'm using is trying to identify the position of the string "agency_id" in the first line of the text so that I can use the corresponding value in any subsequent lines. Here's a snippet of the code:

zipped_feed = ZipFile(feed_name, "r")
agency_file = zipped_feed.open("agency.txt", "r")

line_num = 0
agencyline = agency_file.readline()
while agencyline:
    if line_num == 0:
        # this is the header, all we care about is the agency_id
        lineparts = agencyline.split(",")
        position = -1
        counter = 0
        for part in lineparts:
            part = part.strip()
            if part == "agency_id":
                position = counter              
        counter += 1
        line_num += 1
        agencyline = agency_file.readline()
    else:
        .....

This code works for some zip archives, but not for others. I did some research and tried printing repr(part) and i got '\xef\xbb\xbfagency_id' instead of 'agency_id'. Does anyone know what's going on here and how I can fix it? Thanks for all the help!

beerbajay
  • 19,652
  • 6
  • 58
  • 75
jmetz
  • 815
  • 1
  • 9
  • 19
  • Sorry! I was in the process of editing when you did it! – jmetz Jun 02 '12 at 17:42
  • Note that if code is to be executed on only the first (or last) iteration, it's more performant and clearer to move that code before (or after) the loop. Also, you can use `position = lineparts.index('agency_id')` to find the position of the desired field in a line and `for agencyline in agency_file` to loop over remaining lines in the file. Once your program runs correctly, you may want to post it on [codereview.SE](http://codereview.stackexchange.com/) for more feedback. – outis Jun 02 '12 at 17:54
  • UTF-8 files should not have BOMs: they are “neither required nor recommended” by the Unicode Standard. This smells like a Windows bug. – tchrist Jun 03 '12 at 04:50

4 Answers4

5

That is a Byte Order Mark, which tells the encoding of the file and in the case of UTF-16 and UTF-32 it also tells the endianess of the file. You can either interpret it or check for it and remove it from your string. To remove it you could do this:

import codecs

unicode(part, "utf8").lstrip(codecs.BOM_UTF8.decode("utf8", "strict"))
Kjir
  • 4,437
  • 4
  • 29
  • 34
3

Your input file seems to be utf-8 and starting with a 'ZERO WIDTH NO-BREAK SPACE'-character,

import unicodedata
unicodedata.name('\xef\xbb\xbf'.decode('utf8'))
# gives: 'ZERO WIDTH NO-BREAK SPACE'

which is used as a BOM (or more accurately to identify the file as being utf8, as byte order isn't really accurate with utf8, but it's commonly called BOM anyway)

mata
  • 67,110
  • 10
  • 163
  • 162
0

Simple: some of your zip archives are printing the Unicode BOM (Byte Order Mark) at the beginning of the string. This is used to indicate the byte order for use with multi-byte encodings. This means you're reading in a Unicode string (probably UTF-16 encoded) as a bytestring. Easiest thing to do would be check for it at the start of the string and remove it.

Lukasa
  • 14,599
  • 4
  • 32
  • 34
0

What you've got is a file that may occasionally have a Unicode Byte Order mark at the front of the file. Sometimes this is introduced by editors to indicate encoding.

Here's some details - http://en.wikipedia.org/wiki/Byte_order_mark

Bottom line is that you could look for the \xef\xbb\xbf string which is the marker for UTF-8 encoded data and just strip it. Or the other choice is to open it with the codecs package

with codecs.open('input', 'r', 'utf-8') as file: 

or in your case

zipped_feed = ZipFile(feed_name, "r")
# adding a StreamReader around the zipped_feed.open(...)
agency_file = codecs.StreamReader(zipped_feed.open("agency.txt", "r"))
koblas
  • 25,410
  • 6
  • 39
  • 49