13

I use the following simple Python script to compress a large text file (say, 10GB) on an EC2 m3.large instance. However, I always got a MemoryError:

import gzip

with open('test_large.csv', 'rb') as f_in:
    with gzip.open('test_out.csv.gz', 'wb') as f_out:
        f_out.writelines(f_in)
        # or the following:
        # for line in f_in:
        #     f_out.write(line)

The traceback I got is:

Traceback (most recent call last):
  File "test.py", line 8, in <module>
    f_out.writelines(f_in)
MemoryError

I have read some discussion about this issue, but still not quite clear how to handle this. Can someone give me a more understandable answer about how to deal with this problem?

shihpeng
  • 5,283
  • 6
  • 37
  • 63
  • What is the exact error with Mark's solution ? It cannot be on `f_out.writelines`, since you use `write` ... – Serge Ballesta Nov 20 '14 at 09:39
  • The error will be like this: `Traceback (most recent call last): File "test.py", line 8, in for line in f_in: MemoryError` – shihpeng Nov 20 '14 at 09:41

3 Answers3

18

The problem here has nothing to do with gzip, and everything to do with reading line by line from a 10GB file with no newlines in it:

As an additional note, the file I used to test the Python gzip functionality is generated by fallocate -l 10G bigfile_file.

That gives you a 10GB sparse file made entirely of 0 bytes. Meaning there are no newline bytes. Meaning the first line is 10GB long. Meaning it will take 10GB to read the first line. (Or possibly even 20 or 40GB, if you're using pre-3.3 Python and trying to read it as Unicode.)

If you want to copy binary data, don't copy line by line. Whether it's a normal file, a GzipFile that's decompressing for you on the fly, a socket.makefile(), or anything else, you will have the same problem.

The solution is to copy chunk by chunk. Or just use copyfileobj, which does that for you automatically.

import gzip
import shutil

with open('test_large.csv', 'rb') as f_in:
    with gzip.open('test_out.csv.gz', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

By default, copyfileobj uses a chunk size optimized to be often very good and never very bad. In this case, you might actually want a smaller size, or a larger one; it's hard to predict which a priori.* So, test it by using timeit with different bufsize arguments (say, powers of 4 from 1KB to 8MB) to copyfileobj. But the default 16KB will probably be good enough unless you're doing a lot of this.

* If the buffer size is too big, you may end up alternating long chunks of I/O and long chunks of processing. If it's too small, you may end up needing multiple reads to fill a single gzip block.

abarnert
  • 354,177
  • 51
  • 601
  • 671
10

That's odd. I would expect this error if you tried to compress a large binary file that didn't contain many newlines, since such a file could contain a "line" that was too big for your RAM, but it shouldn't happen on a line-structured .csv file.

But anyway, it's not very efficient to compress files line by line. Even though the OS buffers disk I/O it's generally much faster to read and write larger blocks of data, eg 64 kB.

I have 2GB of RAM on this machine, and I just successfully used the program below to compress a 2.8GB tar archive.

#! /usr/bin/env python

import gzip
import sys

blocksize = 1 << 16     #64kB

def gzipfile(iname, oname, level):
    with open(iname, 'rb') as f_in:
        f_out = gzip.open(oname, 'wb', level)
        while True:
            block = f_in.read(blocksize)
            if block == '':
                break
            f_out.write(block)
        f_out.close()
    return


def main():
    if len(sys.argv) < 3:
        print "gzip compress in_file to out_file"
        print "Usage:\n%s in_file out_file [compression_level]" % sys.argv[0]
        exit(1)

    iname = sys.argv[1]
    oname = sys.argv[2]
    level = int(sys.argv[3]) if len(sys.argv) > 3 else 6

    gzipfile(iname, oname, level)


if __name__ == '__main__':  
    main()

I'm running Python 2.6.6 and gzip.open() doesn't support with.


As Andrew Bay notes in the comments, if block == '': won't work correctly in Python 3, since block contains bytes, not a string, and an empty bytes object doesn't compare as equal to an empty text string. We could check the block length, or compare to b'' (which will also work in Python 2.6+), but the simple way is if not block:.

PM 2Ring
  • 54,345
  • 6
  • 82
  • 182
  • As an additional note, the file I used to test the Python gzip functionality is generated by `fallocate -l 10G bigfile_file`. Python can not gzip such big files by the file iteratable (it seems to be a bug since long time ago?). – shihpeng Nov 21 '14 at 03:43
  • 1
    @shihpeng: I'm not familiar with `fallocate`, so this is just a guess, but maybe Python's gzip doesn't like such files because they don't contain any actual data. I can't test it since I'm still using ext3 on this system, which doesn't support `fallocate`. However, my program works ok using a big file created using `truncate`, which creates sparse files. – PM 2Ring Nov 21 '14 at 07:41
  • "Even though the OS buffers disk I/O it's generally much faster to read and write larger blocks of data, eg 64 kB." `GzipFile` uses buffered I/O (either C stdio buffers in Python 2.x, or Python `io` buffers in 3.x). It only read from disk when it tries to uncompress another zlib block and that block goes beyond the buffer. So it's already doing everything you're trying to do here. The only difference is that you're using a larger blocksize; if that actually helps, you can just open the file manually and construct a `GzipFile` from it instead of using `gzip.open`. – abarnert Nov 21 '14 at 20:02
  • Also, it's not that `fallocate` files don't contain any actual data; as far as Python can tell, the file contains 10GB of data, and it's all 0's. So it's exactly what you suspected in the first place. – abarnert Nov 21 '14 at 20:18
  • @abarnert: Thanks for that info about the buffering. And, yeah, I temporarily forgot about the stuff I said re newlines when I was discussing `fallocate`. :) :oops: – PM 2Ring Nov 22 '14 at 05:01
  • 1
    In Python 3.x, instead of using the `if block == ''` use the length of the block to determine if the block is empty. This is due to the fact that the string is unicode and cannot be compared to the block. – AndrewBay May 11 '17 at 07:41
  • @AndrewBay Thanks for the heads-up. I've added some info to my answer. I won't bother re-writing this code for Python 3. I guess it's still a handy example, but the technique in Andrew Barnert's answer is superior. – PM 2Ring May 11 '17 at 11:56
3

It is weird to get a memory error even when reading a file line by line. I suppose it is because you have very little available memory and very large lines. You should then use binary reads :

import gzip

#adapt size value : small values will take more time, high value could cause memory errors
size = 8096

with open('test_large.csv', 'rb') as f_in:
    with gzip.open('test_out.csv.gz', 'wb') as f_out:
        while True:
            data = f_in.read(size)
            if data == '' : break
            f_out.write(data)
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
  • Yes, m3.large only have 2 vcpu and 7gb memory, which is very limited if there are some other processes or servers running on the same instance. – shihpeng Nov 21 '14 at 08:42
  • This only copies the first 8KB. – abarnert Nov 21 '14 at 20:19
  • Now it will loop forever, writing empty strings forever after EOF. – abarnert Nov 24 '14 at 20:27
  • @abarnert: Fixed ... and tested this time :-) – Serge Ballesta Nov 25 '14 at 04:43
  • OK, now compare this to `copyfileobj(f_in, f_out, size)`, which wouldn't require 3 fixes to get right (because `copyfileobj` has already been written, tested, and optimized, and used in thousands of projects for the past few decades) and is easier to read… – abarnert Nov 25 '14 at 19:52
  • @abarnert +1 for you. I seldom used `shutil`. But now I've just read again the module doc and I will remember it :-) (Even if the 3 fixes were more caused by a lack of test than by the complexity of my script, I'm always more confident in Python Standard Library than in my own code ...) – Serge Ballesta Nov 25 '14 at 20:29
  • 1
    @SergeBallesta: You should see some of the 180-line monstrosities I've written and spent hours debugging only to realize I'd duplicated something that came with the stdlib for free. :) – abarnert Nov 25 '14 at 20:30