Is there an extra character on the first readline() of a file?

Question

The first line of a file I'm reading in seems to obey different rules than the rest of the lines.

Expected behavior: Each line checks for a hash at the beginning and if it doesn't find one, then it does actions.

Actual behavior: That's true except for the first line. The first line somehow gets through to a try/except in checkForMatch().

Hack: If I include a second readline to get past the first one, all subsequent lines work fine. If I handle the try/except correctly to report and skip the first line, all subsequent lines work fine.

rulesFile = open("example.tsv","r",encoding="utf-8")

# line = rulesFile.readline()
line = rulesFile.readline()
while line != "":
    line = line.lstrip()
    line = line.rstrip()
    if line != "" and line[0] != "#":
        checkForMatch(line, args)
    line = rulesFile.readline()

The first and second lines both consist of hash, space, ascii text.

# First line
# Second line

I looked at some other answers and tried replacing

line[0] != "#"

with

not line.startswith("#")

It may be more pythonic, but the output remains identical.

Is there a secret initial character on the first line of a file, or some other subtle problem here?

score 3 · Accepted Answer · answered Mar 05 '15 at 00:52

3

You forgot about the BOM.

rulesFile = open("example.tsv", "r", encoding="utf-8-sig")

answered Mar 05 '15 at 00:52

Ignacio Vazquez-Abrams

776,304
153
1,341
1,358

Forget, not at all. Had never seen the encoding notation and trusted someone else's old script... Does -sig work for all bit orders or are there different encoding values I might have to try? – Bibliotango Mar 05 '15 at 00:57
UTF-8 only has a single byte order. – Ignacio Vazquez-Abrams Mar 05 '15 at 01:05
I'm dropping the link to more information (and arguments) about BOM and UTF-8, because I was trying to figure out why both utf-8 and utf-8-sig existed. If you're wondering that too, here you go. http://stackoverflow.com/questions/2223882/whats-different-between-utf-8-and-utf-8-without-bom – Bibliotango Mar 05 '15 at 22:58

Is there an extra character on the first readline() of a file?

1 Answers1