7

I have big gzip compressed files. I have written a piece of code to split those files into smaller ones. I can specify the number of lines per file. The thing is I recently increased the number of line per split to 16,000,000, and when I process bigger files, the split wont happen. Sometimes a smaller file is successfully produced, sometimes one is produced but weighs only 40B or 50B, which is a failure. I tried to catch an exception for that by looking at those raised in the gzip code. So my code looks like this:

def writeGzipFile(file_name, content):
    import gzip
    with gzip.open(file_name, 'wb') as f:
        if not content == '':
            try:
                f.write(content)
            except IOError as ioe:
                print "I/O ERROR wb", ioe.message
            except ValueError as ve:
                print "VALUE ERROR wb: ", ve.message
            except EOFError as eofe:
                print "EOF ERROR wb: ", eofe.message
            except:
                print "UNEXPECTED ERROR wb"

The thing is when the content is too high, related to the number of lines, I often get the "UNEXPECTED ERROR" message. So I have no idea which kind of error is thrown here.

I finally found that the number of lines were the problem, and it appears python's gzip fails at writing such amount of data in one file at once. Lowering the number of lines per split down to 4,000,000 works. However I would like to split the content and to write sequencially to a file to make sure that even high data content get to be written.

So I would like to know how to find out the maximum number of characters that can be written for sure in one go in a file using gzip, without any failure.


EDIT 1

So I caugth all remaining exceptions (I did not know that it was possible to simply catch Exception sorry) :

def writeGzipFile(file_name, content, file_permission=None):
    import gzip, traceback
    with gzip.open(file_name, 'wb') as f:
        if not content == '':
            try:
                f.write(content)
            except IOError as ioe:
                print "I/O ERROR wb", ioe.message
            except ValueError as ve:
                print "VALUE ERROR wb: ", ve.message
            except EOFError as eofe:
                print "EOF ERROR wb: ", eofe.message
            except Exception, err:
                print "EXCEPTION:", err.message
                print "TRACEBACK_1:", traceback.print_exc(file=sys.stdout)
            except:
                print "UNEXPECTED ERROR wb"

The error is about int size. I never thought I would have exceeded the int size one day:

EXCEPTION: size does not fit in an int
TRACEBACK_1:Traceback (most recent call last):
  File "/home/anadin/dev/illumina-project-restructor_mass-splitting/illumina-project-restructor/tools/file_utils/file_compression.py", line 131, in writeGzipFile
    f.write(content)
  File "/usr/local/cluster/python2.7/lib/python2.7/gzip.py", line 230, in write
    self.crc = zlib.crc32(data, self.crc) & 0xffffffffL
OverflowError: size does not fit in an int
None

Ok so the max size of an int being 2,147,483,647, my chunk of data is about 3,854,674,090 according to my log. This chunk is a string to which I applied the __len__() function.

So as I planned to do, and as Antti Haapala suggested, I am about to read smaller chunks at a time in order to sequencially write them to smaller files.

kaligne
  • 3,098
  • 9
  • 34
  • 60
  • 1
    Ans when you've fixed this - *don't use unnamed Exception blocks* - because they cause precisely this sort of problem by hiding the failure mechanism. (Unless you *really* don't care if an operation fails). – SiHa Mar 15 '16 at 12:49
  • 1
    If it works when you lower the amount of data you're processing, then I'm pretty sure that the problem is that you're holding *all* of the data in a single string: `content`. This will probably use up all your memory. I don't see the rest of your code, but generally you want to read a small chunk of data (usually a line), do with it whatever you want, and then feed it to gzip which will then write that little chunk to disk. You then repeat this 16 million (or however often) times, and at no point will the occupied memory be more than a single line... – Martin Tournoij Mar 15 '16 at 12:51
  • FWIW, [this answer](http://stackoverflow.com/a/27037303/4014959) I wrote a couple of years ago shows how to gzip a file block by block. – PM 2Ring Mar 15 '16 at 12:56

1 Answers1

5

In any case, I suspect the reason is some kind of out-of-memory error. It is quite unclear to me why'd you not write this data smaller amount a time; here using the chunks method from this answer:

def chunks(l, n):
    """Yield successive n-sized chunks from l."""
    for i in xrange(0, len(l), n):
        yield l[i:i+n]

...
with gzip.open(file_name, 'wb') as f:
    for chunk in chunks(content, 65536):
        f.write(chunk)

That is, you do it like you'd eat an elephant, take one bite at a time.

Community
  • 1
  • 1
  • Thank you my problem is a bit more complex but the principle is to really write smaller chunks at a time. By the way, why are you limiting the chunk to 65536? Here I am limited by int size as it appears. Can I replace your size by the int range? Or anything more efficient, what is your take? – kaligne Mar 15 '16 at 13:16
  • I still wouldn't try my luck! You'd run out of memory sooner. This splicing actually *copies* that part of a string. – Antti Haapala -- Слава Україні Mar 15 '16 at 13:20
  • @kaligne: 64k is a good chunk size. You *might* get slightly faster performance with a larger block size, depending on your hardware, but IME the improvement is very marginal, and I've conducted tests on a range of drives, both mechanical and solid state. Remember, you're **not** interfacing directly with the hardware, you're going through the OS's driver software which uses efficient disk buffering, and drives have their own on-board caches as well. – PM 2Ring Mar 15 '16 at 13:28
  • 1
    you could use say 1 megabyte perhaps... or 16... but not int range. (and it could be 64-bit too!) – Antti Haapala -- Слава Україні Mar 15 '16 at 13:31
  • @Antti Haapala: heh, good point... And above all you just taught me about the usefulness of xrange! – kaligne Mar 15 '16 at 13:41
  • gzip has its internal buffers; only compressed output is written as it is ready; and gzip will consider the data as a bytestream, not blocks. – Antti Haapala -- Слава Україні Mar 15 '16 at 13:47