19

So I have some fairly gigantic .gz files - we're talking 10 to 20 gb each when decompressed.

I need to loop through each line of them, so I'm using the standard:

import gzip
f = gzip.open(path+myFile, 'r')
for line in f.readlines():
    #(yadda yadda)
f.close()

However, both the open() and close() commands take AGES, using up 98% of the memory+CPU. So much so that the program exits and prints Killed to the terminal. Maybe it is loading the entire extracted file into memory?

I'm now using something like:

from subprocess import call
f = open(path+'myfile.txt', 'w')
call(['gunzip', '-c', path+myfile], stdout=f)
#do some looping through the file
f.close()
#then delete extracted file

This works. But is there a cleaner way?

LittleBobbyTables
  • 4,361
  • 9
  • 38
  • 67

2 Answers2

61

I'm 99% sure that your problem is not in the gzip.open(), but in the readlines().

As the documentation explains:

f.readlines() returns a list containing all the lines of data in the file.

Obviously, that requires reading reading and decompressing the entire file, and building up an absolutely gigantic list.

Most likely, it's actually the malloc calls to allocate all that memory that are taking forever. And then, at the end of this scope (assuming you're using CPython), it has to GC that whole gigantic list, which will also take forever.

You almost never want to use readlines. Unless you're using a very old Python, just do this:

for line in f:

A file is an iterable full of lines, just like the list returned by readlines—except that it's not actually a list, it generates more lines on the fly by reading out of a buffer. So, at any given time, you'll only have one line and a couple of buffers on the order of 10MB each, instead of a 25GB list. And the reading and decompressing will be spread out over the lifetime of the loop, instead of done all at once.

From a quick test, with a 3.5GB gzip file, gzip.open() is effectively instant, for line in f: pass takes a few seconds, gzip.close() is effectively instant. But if I do for line in f.readlines(): pass, it takes… well, I'm not sure how long, because after about a minute my system went into swap thrashing hell and I had to force-kill the interpreter to get it to respond to anything…


Since this has come up a dozen more times since this answer, I wrote this blog post which explains a bit more.

abarnert
  • 354,177
  • 51
  • 601
  • 671
  • @shihpeng `for line in f:` is a pythonic and correct answer. do you have further information otherwise? – FirefighterBlu3 Jun 18 '18 at 20:15
  • @FirefighterBlu3 You're referring to a 4-year-old comment, that refers to an answer that turned out to be misguided and was deleted by the answerer. Probably better to just flag it as no longer needed or ignore it than to reply to it. (If you can't read the deleted answer, shihpeng's problem was that he doesn't actually have text data, but binary data that happens to go for many megabytes without a `\x0a` byte anywhere. The answer there is to not read binary data as text…) – abarnert Jun 18 '18 at 20:24
2

Have a look at pandas, in particular IO tools. They support gzip compression when reading files and you can read files in chunks. Besides, pandas is very fast and memory efficient.

As I never tried, I don't know how well the compression and reading in chunks live together, but it might be worth giving a try

jalanb
  • 1,097
  • 2
  • 11
  • 37
Francesco Montesano
  • 8,485
  • 2
  • 40
  • 64
  • `gzip.open` has perfectly fine buffering, so you don't need to explicitly read in chunks; just use the normal file-like APIs to read it in the way that's most appropriate (`for line in f:`, or `for row in csv.reader(f)`, or even `readlines` with a size hint instead of no args). And it's also quite fast and memory efficient. As near as I can tell, the OP's code is only a memory hog because of `readlines`, and it's only slow because of that memory hogging. – abarnert Feb 01 '13 at 22:44