15

I'm parsing a 20Gb file and outputting lines that meet a certain condition to another file, however occasionally python will read in 2 lines at once and concatenate them.

inputFileHandle = open(inputFileName, 'r')

row = 0

for line in inputFileHandle:
    row =  row + 1
    if line_meets_condition:
        outputFileHandle.write(line)
    else:
        lstIgnoredRows.append(row)

I've checked the line endings in the source file and they check out as line feeds (ascii char 10). Pulling out the problem rows and parsing them in isolation works as expected. Am I hitting some python limitation here? The position in the file of the first anomaly is around the 4GB mark.

James
  • 1,397
  • 3
  • 21
  • 30
  • does the fist anomaly always occur consistently at the same line count? Also, is `lstIgnoredRows` a list, how big does that grow? I wonder what would happen if you just saved the lines you are interested in the output file and didn't do anything with the lines you wanted to ignore. – Levon Apr 19 '12 at 02:41
  • 1
    Maybe you could try reading smaller chunks of the file at a time using a lazy method, similar to this question? Give it a shot http://stackoverflow.com/questions/519633/lazy-method-for-reading-big-file-in-python – prrao Apr 19 '12 at 02:44
  • It happens at the same line count every time. lstIgnoredRows can grow to a few thousand items. – James Apr 19 '12 at 02:47
  • 1
    Side comment: adding strings to `lstIgnoredRows` may become problematic when you've got 20Gb of data. Why not write the ignored row numbers to another files? – Hooked Apr 19 '12 at 02:48
  • @Hooked .. yes my thought too, I was worried about the size (and consumption of memory) with a potentially huge list – Levon Apr 19 '12 at 02:50

2 Answers2

23

Quick google search for "python reading files larger than 4gb" yielded many many results. See here for such an example and another one which takes over from the first.

It's a bug in Python.

Now, the explanation of the bug; it's not easy to reproduce because it depends both on the internal FILE buffer size and the number of chars passed to fread(). In the Microsoft CRT source code, in open.c, there is a block starting with this encouraging comment "This is the hard part. We found a CR at end of buffer. We must peek ahead to see if next char is an LF." Oddly, there is an almost exact copy of this function in Perl source code: http://perl5.git.perl.org/perl.git/blob/4342f4d6df6a7dfa22a470aa21e54a5622c009f3:/win32/win32.c#l3668 The problem is in the call to SetFilePointer(), used to step back one position after the lookahead; it will fail because it is unable to return the current position in a 32bit DWORD. [The fix is easy; do you see it?] At this point, the function thinks that the next read() will return the LF, but it won't because the file pointer was not moved back.

And the work-around:

But note that Python 3.x is not affected (raw files are always opened in binary mode and CRLF translation is done by Python); with 2.7, you may use io.open().

Josh Smeaton
  • 47,939
  • 24
  • 129
  • 164
7

The 4GB mark is suspiciously near the maximum value that can be stored in a 32-bit register (2**32).

The code you've posted looks fine by itself, so I would suspect a bug in your Python build.

FWIW, the snippet would be a little cleaner if it used enumerate:

inputFileHandle = open(inputFileName, 'r')

for row, line in enumerate(inputFileHandle):
    if line_meets_condition:
        outputFileHandle.write(line)
    else:
        lstIgnoredRows.append(row)
Raymond Hettinger
  • 216,523
  • 63
  • 388
  • 485