0

I need my Python script to operate on a gzip-ed files, that may still be written to. Because they haven't been properly closed yet, such operations some times result in an CRC errors at the end.

How can I suppress these errors and simply process everything up to the incomplete ending?

My code is:

if usegzip:
    opener = gzip.open;
else:
    opener = open;

...
for line in opener(input_filename,'r'):
    .... process line ....

The exception I get when a still-opened file is encountered is:

    for line in opener(input_filename,'r'):
  File "/opt/lib/python2.7/gzip.py", line 464, in readline
    c = self.read(readsize)
  File "/opt/lib/python2.7/gzip.py", line 268, in read
    self._read(readsize)
  File "/opt/lib/python2.7/gzip.py", line 315, in _read
    self._read_eof()
  File "/opt/lib/python2.7/gzip.py", line 354, in _read_eof
    hex(self.crc)))
IOError: CRC check failed 0x7248907 != 0x45e82dc4L

Can I somehow suppress it without reimplementing the gzip-module?

Mikhail T.
  • 3,043
  • 3
  • 29
  • 46
  • 1
    take a look [here](https://stackoverflow.com/questions/1732709/unzipping-part-of-a-gz-file-using-python) – sKwa Jan 09 '18 at 22:58
  • Thanks. Yes, it is the same error, but [zlib-module](https://docs.python.org/2/library/zlib.html) in itself does not provide the interface suitable for a drop-in replacement. There is no `zlib.open()` and friends... – Mikhail T. Jan 09 '18 at 23:19

1 Answers1

0

Ok, the solution is to forego the convenience of the for-loop and explicitly iterate over the lines. The explicit iteration can then be put inside try/except to handle the errors. For example, here is the simple counter of lines inside a gzip-ed file:

import gzip
import sys

f = sys.argv[-1]
count = 0
opener = gzip.open

lines = opener(f) # Creates the iterator normally used by for-loop

while 1:
    try:
        line = lines.next()
    except (IOError, StopIteration):
        break
    count += 1

print count

When the file is properly closed, the output of the above script is the same as that of gzcat | wc -l. But, when the file is still written to, the script can read more lines successfully, than gzcat.

Mikhail T.
  • 3,043
  • 3
  • 29
  • 46