4
with open(sourceFileName, 'rt') as sourceFile:
    sourceFileConents = sourceFile.read()
    sourceFileConentsLength = len(sourceFileConents)

    i = 0
    while i < sourceFileConentsLength:
        print(str(i) + ' ' + sourceFileConents[i])
        i += 1

Please forgive the unPythonic for i loop, this is only the test code & there are reasons to do it that way in the real code.

Anyhoo, the real code seemed to be ending the loop sooner than expected, so I knocked up the dummy above, which removes all of the logic of the real code.

The sourceFileConentsLength reports as 13,690, but when I print it out char for char, there are still a few 100 chars more in the file, which are not being printed out.

What gives?

  • Should I be using something other than <fileHandle>.read() to get the file's entire contents into a single string?
  • Have I hit some maximum string length? If so, can I get around it?
  • Might it be line endings if the file was edited in Windows & the script is run in Linux (sorry, I can't post the file, it's company confidential)
  • What else?

[Update] I think that we strike two of those ideas.

For maximum string length, see this question.

I did an ls -lAF to a temp directory. Only 6k+ chars, but the script handed it just fine. Should I be worrying about line endings? If so, what can I do about it? The source files tend to get edited under both Windows & Linux, but the script will only run under Linux.


[Updfate++] I changed the line endings on my input file to Linux in Eclipse, but still got the same result.

Community
  • 1
  • 1
Mawg says reinstate Monica
  • 38,334
  • 103
  • 306
  • 551
  • 1
    Edited and run on same OS it works perfectly. Can you print `repr(sourceFileConents[i])` and tell if any of the contents have a `\r` character? Are there `100` lines in your source file? – Bhargav Rao Feb 24 '15 at 16:00
  • 1
    Have you considered writing `sourceFileContents` to a separate file and then inspecting the two with something like `diff`? If you do this, what do you see? – Two-Bit Alchemist Feb 24 '15 at 16:03
  • 2
    What encoding are you using? – Caramiriel Feb 24 '15 at 16:04
  • 3
    I believe your problem is that read() returns bytes and sourceFileConentsLength is number of bytes, not number of characters. You could convert it to unicode before finding length of it – user4600699 Feb 24 '15 at 16:06

2 Answers2

2

If you read a file in text mode it will automatically convert line endings like \r\n to \n.

Try using

with open(sourceFileName, newline='') as sourceFile:

instead; this will turn off newline-translation (\r\n will be returned as \r\n).

Hugh Bothwell
  • 55,315
  • 8
  • 84
  • 99
1

If your file is encoded in something like UTF-8, you should decode it before counting the characters:

sourceFileContents_utf8 = open(sourceFileName, 'r+').read()
sourceFileContents_unicode = sourceFileContents_utf8.decode('utf8')
print(len(sourceFileContents_unicode))

i = 0
source_file_contents_length = len(sourceFileContents_unicode)
while i < source_file_contents_length:
    print('%s %s' % (str(i), sourceFileContents[i]))
    i += 1
Tui Popenoe
  • 2,098
  • 2
  • 23
  • 44
  • What am I doing wrong? Python v3.2.3 Exception: sourceFileContents_unicode = sourceFileContents_utf8.decode(\'utf8\')\n', "AttributeError: 'str' object has no attribute 'decode' – Mawg says reinstate Monica Feb 25 '15 at 08:06