6

I have a list of very large .tar.gz files (each >2GB) that I want to process using python (Mannheim Webtables Corpus). Each archive file contains millions of .json files which only contain a single json object. What I need to do is to iterate thru all the files one by one, create a json object using the content of the file and processing it subsequently.

When I try using tarfile.open it is painfully slow as it tries to extract and load the whole file into the memory.

Here is the first attempt I made:

import os
import tarfile
input_files = os.listdir('CORPUS_MANNHEIM')
for file in input_files:
    with tarfile.open('CORPUS_MANNHEIM'+'/'+file) as tfile:
        for jsonfile in tfile.getmembers():
            f=tfile.extractfile(jsonfile)
            content=f.read()
            print(content)

The above code is painfully slow and crashes the Jupyter notebook. I have another corpus with a list of .gz files which I can easily iterate over. However, with .tar.gz files, it seems there is just no way.

I have tried a few other options such as first extracting .tar files from .tar.gz file using gunzip or tar -xvf and then processing with no luck.

Please let me know if you need any further details. I tried to keep the question as short as possible.

Edit: When I try to read .tar files using head, it seems it can stream quite fast. The output is a little weird though. It first outputs the file name proceeded with the contents of the file, which is a little inconvenient. You can try with head --bytes 1000 03.tar.

Edit: My question is different than the others of a similar nature in that, tarfile seems not to work at all even if I try streaming. Thus, I need a much more efficient approach - something like head or less in linux that can ~instantly get a stream without having to extract which tarfile does.

Community
  • 1
  • 1
Ahmadov
  • 1,567
  • 5
  • 31
  • 48
  • One solution could be to empty the member property of tfile at each iteration. It looks like it reduces the memory load, as explained here: https://blogs.it.ox.ac.uk/inapickle/2011/06/20/high-memory-usage-when-using-pythons-tarfile-module/ – Alessandro Cosentino Feb 21 '18 at 14:31
  • Possible duplicate of [Tarfile in Python: Can I untar more efficiently by extracting only some of the data?](https://stackoverflow.com/questions/26067471/tarfile-in-python-can-i-untar-more-efficiently-by-extracting-only-some-of-the-d) – Andrew Henle Feb 21 '18 at 15:16

0 Answers0