How can I read from a corrupted tar.bz2 file in Python?

Question

I have a program which saves its output to a tar.bz2 file as it works. I have a python script which processes that data.

I'd like to be able to work with the output if the first program is interrupted — or just run the python script against it while the process is ongoing.

Of course, the final bzip2 block is unfinished, so it can't be read — it's effectively corrupted, although really it's just truncated. GNU tar will actually happily extract all that it can of the file up to that point — as will bzcat, for that matter. And bzip2recover can create repaired blocks, although it's really less useful in this case than bzcat.

But I'm trying to use Python's standard tarfile module. This fails with

  File "/usr/lib64/python2.7/tarfile.py", line 2110, in extractfile
    tarinfo = self.getmember(member)
  File "/usr/lib64/python2.7/tarfile.py", line 1792, in getmember
    tarinfo = self._getmember(name)
  File "/usr/lib64/python2.7/tarfile.py", line 2361, in _getmember
    members = self.getmembers()
  File "/usr/lib64/python2.7/tarfile.py", line 1803, in getmembers
    self._load()        # all members, we first have to
  File "/usr/lib64/python2.7/tarfile.py", line 2384, in _load
    tarinfo = self.next()
  File "/usr/lib64/python2.7/tarfile.py", line 2319, in next
    self.fileobj.seek(self.offset)
EOFError: compressed file ended before the logical end-of-stream was detected

when I try to use TarFile.extractfile on a file that I know is at the beginning. (tar -xf tarfile.tar.bz2 filename will extract it just fine.)

Is there anything clever I can do to ignore the invalid end to the file and work with what I've got?

The data set can get rather large, and is very, very compressible, so keeping it uncompressed is not desirable.

(I found the existing question Untar archive in Python with errors, but in that case, the user is trying to os.system the tar file.)

score 1 · Answer 1 · answered Feb 29 '12 at 01:44

There seems to be 2 possibilities. Firstly, and most likely:

If ignore_zeros is False, treat an empty block as the end of the archive. If it is True, skip empty (and invalid) blocks and try to get as many members as possible. This is only useful for reading concatenated or damaged archives.

Secondly:

For special purposes, there is a second format for mode: 'filemode|[compression]'. tarfile.open() will return a TarFile object that processes its data as a stream of blocks. No random seeking will be done on the file. If given, fileobj may be any object that has a read() or write() method (depending on the mode). bufsize specifies the blocksize and defaults to 20 * 512 bytes. Use this variant in combination with e.g. sys.stdin, a socket file object or a tape device. However, such a TarFile object is limited in that it does not allow to be accessed randomly

Sounds like accessing the file as stream may be useful when the file is incomplete.

Thanks. I'll try that, although it'll require some rethinking of my code. Apparently `extractfile` and then iterating over the lines produces a backwards seek. — mattdm, Feb 29 '12 at 01:48

How can I read from a corrupted tar.bz2 file in Python?

1 Answers1