1

The first line of a file I'm reading in seems to obey different rules than the rest of the lines.

Expected behavior: Each line checks for a hash at the beginning and if it doesn't find one, then it does actions.

Actual behavior: That's true except for the first line. The first line somehow gets through to a try/except in checkForMatch().

Hack: If I include a second readline to get past the first one, all subsequent lines work fine. If I handle the try/except correctly to report and skip the first line, all subsequent lines work fine.

rulesFile = open("example.tsv","r",encoding="utf-8")

# line = rulesFile.readline()
line = rulesFile.readline()
while line != "":
    line = line.lstrip()
    line = line.rstrip()
    if line != "" and line[0] != "#":
        checkForMatch(line, args)
    line = rulesFile.readline()

The first and second lines both consist of hash, space, ascii text.

# First line
# Second line

I looked at some other answers and tried replacing

line[0] != "#"

with

not line.startswith("#")

It may be more pythonic, but the output remains identical.

Is there a secret initial character on the first line of a file, or some other subtle problem here?

Bibliotango
  • 193
  • 11

1 Answers1

3

You forgot about the BOM.

rulesFile = open("example.tsv", "r", encoding="utf-8-sig")
Ignacio Vazquez-Abrams
  • 776,304
  • 153
  • 1,341
  • 1,358
  • Forget, not at all. Had never seen the encoding notation and trusted someone else's old script... Does -sig work for all bit orders or are there different encoding values I might have to try? – Bibliotango Mar 05 '15 at 00:57
  • UTF-8 only has a single byte order. – Ignacio Vazquez-Abrams Mar 05 '15 at 01:05
  • I'm dropping the link to more information (and arguments) about BOM and UTF-8, because I was trying to figure out why both utf-8 and utf-8-sig existed. If you're wondering that too, here you go. http://stackoverflow.com/questions/2223882/whats-different-between-utf-8-and-utf-8-without-bom – Bibliotango Mar 05 '15 at 22:58