I have big gzip compressed files.
I have written a piece of code to split those files into smaller ones. I can specify
the number of lines per file. The thing is I recently increased the number of line per split
to 16,000,000, and when I process bigger files, the split wont happen. Sometimes a smaller file
is successfully produced, sometimes one is produced but weighs only 40B or 50B, which is
a failure. I tried to catch an exception for that by looking at those raised in the gzip
code. So my code looks like this:
def writeGzipFile(file_name, content):
import gzip
with gzip.open(file_name, 'wb') as f:
if not content == '':
try:
f.write(content)
except IOError as ioe:
print "I/O ERROR wb", ioe.message
except ValueError as ve:
print "VALUE ERROR wb: ", ve.message
except EOFError as eofe:
print "EOF ERROR wb: ", eofe.message
except:
print "UNEXPECTED ERROR wb"
The thing is when the content is too high, related to the number of lines, I often get the "UNEXPECTED ERROR" message. So I have no idea which kind of error is thrown here.
I finally found that the number of lines were the problem, and it appears python's gzip
fails at writing such amount of data in one file at once. Lowering the number of lines per split down to 4,000,000 works. However I would like to split the content and to write sequencially to a file to make sure that even high data content get to be written.
So I would like to know how to find out the maximum number of characters that can be written for sure in one go in a file using gzip
, without any failure.
EDIT 1
So I caugth all remaining exceptions (I did not know that it was possible to simply catch Exception
sorry) :
def writeGzipFile(file_name, content, file_permission=None):
import gzip, traceback
with gzip.open(file_name, 'wb') as f:
if not content == '':
try:
f.write(content)
except IOError as ioe:
print "I/O ERROR wb", ioe.message
except ValueError as ve:
print "VALUE ERROR wb: ", ve.message
except EOFError as eofe:
print "EOF ERROR wb: ", eofe.message
except Exception, err:
print "EXCEPTION:", err.message
print "TRACEBACK_1:", traceback.print_exc(file=sys.stdout)
except:
print "UNEXPECTED ERROR wb"
The error is about int
size. I never thought I would have exceeded the int size one day:
EXCEPTION: size does not fit in an int
TRACEBACK_1:Traceback (most recent call last):
File "/home/anadin/dev/illumina-project-restructor_mass-splitting/illumina-project-restructor/tools/file_utils/file_compression.py", line 131, in writeGzipFile
f.write(content)
File "/usr/local/cluster/python2.7/lib/python2.7/gzip.py", line 230, in write
self.crc = zlib.crc32(data, self.crc) & 0xffffffffL
OverflowError: size does not fit in an int
None
Ok so the max size of an int being 2,147,483,647, my chunk of data is about 3,854,674,090 according to my log. This chunk is a string to which I applied the __len__()
function.
So as I planned to do, and as Antti Haapala suggested, I am about to read smaller chunks at a time in order to sequencially write them to smaller files.