Python huge file reading

Question

I need to read a big datafile (~200GB) , line by line using a Python script.

I have tried the regular line by line methods, However those methods use a large amount of memory. I want to be able to read the file chunk by chunk.

Is there a better way to load a large file line by line, say

a) by explicitly mentioning the maximum number of lines the file could load at any one time in memory ? Or b) by loading it by chunks of size, say, 1024 bytes, provided the last line of the said chunk loads completely without being truncated?

Two quick suggestions: you may want to explain why you need such a huge file in case your use case overlaps with an existing library and you should post some example code showing what you have tried. — Spaceghost, Aug 20 '14 at 18:42
This doesn't work for you? [http://stackoverflow.com/questions/8009882/how-to-read-large-file-line-by-line-in-python][1] [1]: http://stackoverflow.com/questions/8009882/how-to-read-large-file-line-by-line-in-python — WitYoBadSelf, Aug 20 '14 at 18:44
Is the file text or binary? For such a huge file, it is probably binary and you should use an idiom to read and process in appropriately sized binary chunks. — dawg, Aug 20 '14 at 18:47
Simply reading line by line like `for line in open('mybigfile'):` does not use much memory (assuming the lines themselves aren't enormous). Have you tried this method? — tdelaney, Aug 20 '14 at 21:04

score 2 · Answer 1 · 2014-08-20T18:58:51.080

Instead of reading it all at once, try reading it line by line:

with open("myFile.txt") as f:
    for line in f:
        #Do stuff with your line

Or, if you want to read N lines in at a time:

with open("myFile.txt") as myfile:
    head = [next(myfile) for x in xrange(N)]
    print head

To handle the StopIteration error that comes from hitting the end of the file, it's a simple try/catch (although there are plenty of ways).

try:
    head = [next(myfile) for x in xrange(N)]
except StopIteration:
    rest_of_lines = [line for line in myfile]

Or you can read those last lines in however you want.

Your multi-line version raises `StopIteration` if you try to read past the end of the file — hlt, Aug 20 '14 at 18:46

hlt · Answer 2 · 2014-08-20T18:50:43.667

To iterate over the lines of a file, do not use readlines. Instead, iterate over the file itself (you will find versions using xreadlines - it is deprecated and simply returns the file object itself) or :

with open(the_path, 'r') as the_file:
    for line in the_file:
        # Do stuff with the line

To read multiple lines at a time, you can use next on the file (it is an iterator), but you need to catch StopIteration, which indicates that there is no data left:

with open(the_path, 'r') as the_file:
    the_lines = []
    done = False
    for i in range(number_of_lines): # Use xrange on Python 2
        try:
            the_lines.append(next(the_file))
        except StopIteration:
            done = True # Reached end of file
    # Do stuff with the lines
    if done:
        break # No data left

Of course, you can also load the file in chunks of a specified byte count:

with open(the_path, 'r') as the_file:
    while True:
        data = the_file.read(the_byte_count)
        if len(data) == 0:
            # All data is gone
            break
        # Do stuff with the data chunk

You probably don't want `xreadlines` (even though it does what you want) as it is deprecated in modern versions of Python. — Alex Reynolds, Aug 20 '14 at 18:43

Python huge file reading

2 Answers2

Linked