how to read a large compressed file in python without loading it all in memory

Question

I have large log files that are in compressed format. ie largefile.gz these are commonly 4-7gigs each.

Here's the relevant part of the code:

for filename in os.listdir(path):
     if not filename.startswith("."):
         with open(b, 'a') as newfile,  gzip.GzipFile(path+filename,'rb') as oldfile:
             # BEGIN Reads each remaining line from the log into a list
             data = oldfile.readlines()  
             for line in data:
                 parts = line.split()

after this the code will do some calculations (basically totaling up a the bytes) and will write to a file that says "total bytes for x critera = y". All this works fine in a small file. But on a large file it kills the system

What I think my program is doing is reading the whole file, storing it in data Correct me if i'm wrong but I think its trying to put the whole log into memory first.

Question: how I can read 1 line from the compressed file , process it then move on to the next without trying to store the whole thing in memory first? (or is it really already doing that.. I'm not sure but based on looking at the activity monitor my guess is that it is trying to go all in memory)

Thanks

Generators are used to `yield` values. See this SO: http://stackoverflow.com/questions/519633/lazy-method-for-reading-big-file-in-python — Adam Ranganathan, Jan 31 '17 at 02:05

score 3 · Accepted Answer · answered Jan 31 '17 at 00:43

3

It wasn't storing the entire content in-memory until you told it to. That is to say -- instead of:

# BAD: stores your whole file's decompressed contents, split into lines, in data
data = oldfile.readlines()  
for line in data:
    parts = line.split()

...use:

# GOOD: Iterates a line at a time
for line in oldfile:
    parts = line.split()

...so you aren't storing the entire file in a variable. And obviously, don't store parts anywhere that persists past the one line either.

That easy.

answered Jan 31 '17 at 00:43

Charles Duffy

280,126
43
390
441

1

I think `readlines` is one of the worst methods Python created as far as making the "one obvious way to do it" the wrong way. People see it, and assume it's the correct way to read in lines, and never learn about file objects being iterators naturally. Most of the time, you want to just iterate the file object directly, and on the rare occasions you need it in another form, you could just use `list(myfile)` (or anything else that accepts an iterable and creates a data structure from it) without needing `.readlines()` at all. – ShadowRanger Jan 31 '17 at 02:11
@charles-duffy that seems to work! Is it possible to make it faster by loading say 4 gigs (or some arbitrary number/%) of the file into memory then processing off of that. Would it speed things up or make negligible difference? – chowpay Jan 31 '17 at 02:56
@chowpay, since the compression algorithm is already working in larger chunks than a line at a time, I'd expect it to be negligible. – Charles Duffy Jan 31 '17 at 04:35

how to read a large compressed file in python without loading it all in memory

1 Answers1