Streaming tar.gz files efficiently in python

Question

I have a list of very large .tar.gz files (each >2GB) that I want to process using python (Mannheim Webtables Corpus). Each archive file contains millions of .json files which only contain a single json object. What I need to do is to iterate thru all the files one by one, create a json object using the content of the file and processing it subsequently.

When I try using tarfile.open it is painfully slow as it tries to extract and load the whole file into the memory.

Here is the first attempt I made:

import os
import tarfile
input_files = os.listdir('CORPUS_MANNHEIM')
for file in input_files:
    with tarfile.open('CORPUS_MANNHEIM'+'/'+file) as tfile:
        for jsonfile in tfile.getmembers():
            f=tfile.extractfile(jsonfile)
            content=f.read()
            print(content)

The above code is painfully slow and crashes the Jupyter notebook. I have another corpus with a list of .gz files which I can easily iterate over. However, with .tar.gz files, it seems there is just no way.

I have tried a few other options such as first extracting .tar files from .tar.gz file using gunzip or tar -xvf and then processing with no luck.

Please let me know if you need any further details. I tried to keep the question as short as possible.

Edit: When I try to read .tar files using head, it seems it can stream quite fast. The output is a little weird though. It first outputs the file name proceeded with the contents of the file, which is a little inconvenient. You can try with head --bytes 1000 03.tar.

Edit: My question is different than the others of a similar nature in that, tarfile seems not to work at all even if I try streaming. Thus, I need a much more efficient approach - something like head or less in linux that can ~instantly get a stream without having to extract which tarfile does.

One solution could be to empty the member property of tfile at each iteration. It looks like it reduces the memory load, as explained here: https://blogs.it.ox.ac.uk/inapickle/2011/06/20/high-memory-usage-when-using-pythons-tarfile-module/ — Alessandro Cosentino, Feb 21 '18 at 14:31
Possible duplicate of [Tarfile in Python: Can I untar more efficiently by extracting only some of the data?](https://stackoverflow.com/questions/26067471/tarfile-in-python-can-i-untar-more-efficiently-by-extracting-only-some-of-the-d) — Andrew Henle, Feb 21 '18 at 15:16

Streaming tar.gz files efficiently in python

0 Answers0