File contents not as long as expected

Question

with open(sourceFileName, 'rt') as sourceFile:
    sourceFileConents = sourceFile.read()
    sourceFileConentsLength = len(sourceFileConents)

    i = 0
    while i < sourceFileConentsLength:
        print(str(i) + ' ' + sourceFileConents[i])
        i += 1

Please forgive the unPythonic for i loop, this is only the test code & there are reasons to do it that way in the real code.

Anyhoo, the real code seemed to be ending the loop sooner than expected, so I knocked up the dummy above, which removes all of the logic of the real code.

The sourceFileConentsLength reports as 13,690, but when I print it out char for char, there are still a few 100 chars more in the file, which are not being printed out.

What gives?

Should I be using something other than <fileHandle>.read() to get the file's entire contents into a single string?
Have I hit some maximum string length? If so, can I get around it?
Might it be line endings if the file was edited in Windows & the script is run in Linux (sorry, I can't post the file, it's company confidential)
What else?

[Update] I think that we strike two of those ideas.

For maximum string length, see this question.

I did an ls -lAF to a temp directory. Only 6k+ chars, but the script handed it just fine. Should I be worrying about line endings? If so, what can I do about it? The source files tend to get edited under both Windows & Linux, but the script will only run under Linux.

[Updfate++] I changed the line endings on my input file to Linux in Eclipse, but still got the same result.

Edited and run on same OS it works perfectly. Can you print `repr(sourceFileConents[i])` and tell if any of the contents have a `\r` character? Are there `100` lines in your source file? — Bhargav Rao, Feb 24 '15 at 16:00
Have you considered writing `sourceFileContents` to a separate file and then inspecting the two with something like `diff`? If you do this, what do you see? — Two-Bit Alchemist, Feb 24 '15 at 16:03
I believe your problem is that read() returns bytes and sourceFileConentsLength is number of bytes, not number of characters. You could convert it to unicode before finding length of it — user4600699, Feb 24 '15 at 16:06

score 2 · Answer 1 · answered Feb 24 '15 at 16:10

2

If you read a file in text mode it will automatically convert line endings like \r\n to \n.

Try using

with open(sourceFileName, newline='') as sourceFile:

instead; this will turn off newline-translation (\r\n will be returned as \r\n).

answered Feb 24 '15 at 16:10

Hugh Bothwell

55,315
8
84
99

A nice theory. Alas, it did not help :-( – Mawg says reinstate Monica Feb 25 '15 at 07:58

score 1 · Accepted Answer · answered Feb 24 '15 at 16:35

1

If your file is encoded in something like UTF-8, you should decode it before counting the characters:

sourceFileContents_utf8 = open(sourceFileName, 'r+').read()
sourceFileContents_unicode = sourceFileContents_utf8.decode('utf8')
print(len(sourceFileContents_unicode))

i = 0
source_file_contents_length = len(sourceFileContents_unicode)
while i < source_file_contents_length:
    print('%s %s' % (str(i), sourceFileContents[i]))
    i += 1

answered Feb 24 '15 at 16:35

Tui Popenoe

2,098
2
23
44

What am I doing wrong? Python v3.2.3 Exception: sourceFileContents_unicode = sourceFileContents_utf8.decode(\'utf8\')\n', "AttributeError: 'str' object has no attribute 'decode' – Mawg says reinstate Monica Feb 25 '15 at 08:06

File contents not as long as expected

2 Answers2