0

I need to read a big datafile (~200GB) , line by line using a Python script.

I have tried the regular line by line methods, However those methods use a large amount of memory. I want to be able to read the file chunk by chunk.

Is there a better way to load a large file line by line, say

a) by explicitly mentioning the maximum number of lines the file could load at any one time in memory ? Or b) by loading it by chunks of size, say, 1024 bytes, provided the last line of the said chunk loads completely without being truncated?

Angelo
  • 61
  • 1
  • 8
  • Two quick suggestions: you may want to explain why you need such a huge file in case your use case overlaps with an existing library and you should post some example code showing what you have tried. – Spaceghost Aug 20 '14 at 18:42
  • This doesn't work for you? [http://stackoverflow.com/questions/8009882/how-to-read-large-file-line-by-line-in-python][1] [1]: http://stackoverflow.com/questions/8009882/how-to-read-large-file-line-by-line-in-python – WitYoBadSelf Aug 20 '14 at 18:44
  • Is the file text or binary? For such a huge file, it is probably binary and you should use an idiom to read and process in appropriately sized binary chunks. – dawg Aug 20 '14 at 18:47
  • Simply reading line by line like `for line in open('mybigfile'):` does not use much memory (assuming the lines themselves aren't enormous). Have you tried this method? – tdelaney Aug 20 '14 at 21:04

2 Answers2

2

Instead of reading it all at once, try reading it line by line:

with open("myFile.txt") as f:
    for line in f:
        #Do stuff with your line

Or, if you want to read N lines in at a time:

with open("myFile.txt") as myfile:
    head = [next(myfile) for x in xrange(N)]
    print head

To handle the StopIteration error that comes from hitting the end of the file, it's a simple try/catch (although there are plenty of ways).

try:
    head = [next(myfile) for x in xrange(N)]
except StopIteration:
    rest_of_lines = [line for line in myfile]

Or you can read those last lines in however you want.

  • Your multi-line version raises `StopIteration` if you try to read past the end of the file – hlt Aug 20 '14 at 18:46
0

To iterate over the lines of a file, do not use readlines. Instead, iterate over the file itself (you will find versions using xreadlines - it is deprecated and simply returns the file object itself) or :

with open(the_path, 'r') as the_file:
    for line in the_file:
        # Do stuff with the line

To read multiple lines at a time, you can use next on the file (it is an iterator), but you need to catch StopIteration, which indicates that there is no data left:

with open(the_path, 'r') as the_file:
    the_lines = []
    done = False
    for i in range(number_of_lines): # Use xrange on Python 2
        try:
            the_lines.append(next(the_file))
        except StopIteration:
            done = True # Reached end of file
    # Do stuff with the lines
    if done:
        break # No data left

Of course, you can also load the file in chunks of a specified byte count:

with open(the_path, 'r') as the_file:
    while True:
        data = the_file.read(the_byte_count)
        if len(data) == 0:
            # All data is gone
            break
        # Do stuff with the data chunk
hlt
  • 6,219
  • 3
  • 23
  • 43
  • 1
    You probably don't want `xreadlines` (even though it does what you want) as it is deprecated in modern versions of Python. – Alex Reynolds Aug 20 '14 at 18:43