I have a Python utility that goes over a tar.xz
file and processes each of the individual files. This is a 15MB compressed file, with 740MB of uncompressed data.
On one specific server with very limited memory, the program crashes because it runs out of memory. I used objgraph to see which objects are created. It turns out that the TarInfo
instances are not being released. The main loop is similar to this:
with tarfile.open(...) as tar:
while True:
next = tar.next()
stream = tar.extractfile(next)
process_stream()
iter+=1
if not iter%1000:
objgraph.show_growth(limit=10)
The output is very consistent:
TarInfo 2040 +1000
TarInfo 3040 +1000
TarInfo 4040 +1000
TarInfo 5040 +1000
TarInfo 6040 +1000
TarInfo 7040 +1000
TarInfo 8040 +1000
TarInfo 9040 +1000
TarInfo 10040 +1000
TarInfo 11040 +1000
TarInfo 12040 +1000
this goes on until all 30,000 files are processed.
Just to make sure, I've commented out the lines creating the stream and processing it. The memory usage remained the same - TarInfo instances are leaked.
I'm using Python 3.4.1, and this behavior is consistent on Ubuntu, OS X and Windows.