I have a program which saves its output to a tar.bz2 file as it works. I have a python script which processes that data.
I'd like to be able to work with the output if the first program is interrupted — or just run the python script against it while the process is ongoing.
Of course, the final bzip2 block is unfinished, so it can't be read — it's effectively corrupted, although really it's just truncated. GNU tar will actually happily extract all that it can of the file up to that point — as will bzcat
, for that matter. And bzip2recover
can create repaired blocks, although it's really less useful in this case than bzcat
.
But I'm trying to use Python's standard tarfile module. This fails with
File "/usr/lib64/python2.7/tarfile.py", line 2110, in extractfile
tarinfo = self.getmember(member)
File "/usr/lib64/python2.7/tarfile.py", line 1792, in getmember
tarinfo = self._getmember(name)
File "/usr/lib64/python2.7/tarfile.py", line 2361, in _getmember
members = self.getmembers()
File "/usr/lib64/python2.7/tarfile.py", line 1803, in getmembers
self._load() # all members, we first have to
File "/usr/lib64/python2.7/tarfile.py", line 2384, in _load
tarinfo = self.next()
File "/usr/lib64/python2.7/tarfile.py", line 2319, in next
self.fileobj.seek(self.offset)
EOFError: compressed file ended before the logical end-of-stream was detected
when I try to use TarFile.extractfile
on a file that I know is at the beginning. (tar -xf tarfile.tar.bz2 filename
will extract it just fine.)
Is there anything clever I can do to ignore the invalid end to the file and work with what I've got?
The data set can get rather large, and is very, very compressible, so keeping it uncompressed is not desirable.
(I found the existing question Untar archive in Python with errors, but in that case, the user is trying to os.system
the tar file.)