I have a list of very large .tar.gz
files (each >2GB) that I want to process using python (Mannheim Webtables Corpus). Each archive file contains millions of .json
files which only contain a single json object. What I need to do is to iterate thru all the files one by one, create a json
object using the content of the file and processing it subsequently.
When I try using tarfile.open
it is painfully slow as it tries to extract and load the whole file into the memory.
Here is the first attempt I made:
import os
import tarfile
input_files = os.listdir('CORPUS_MANNHEIM')
for file in input_files:
with tarfile.open('CORPUS_MANNHEIM'+'/'+file) as tfile:
for jsonfile in tfile.getmembers():
f=tfile.extractfile(jsonfile)
content=f.read()
print(content)
The above code is painfully slow and crashes the Jupyter notebook. I have another corpus with a list of .gz
files which I can easily iterate over. However, with .tar.gz
files, it seems there is just no way.
I have tried a few other options such as first extracting .tar
files from .tar.gz
file using gunzip
or tar -xvf
and then processing with no luck.
Please let me know if you need any further details. I tried to keep the question as short as possible.
Edit: When I try to read .tar files using head
, it seems it can stream quite fast. The output is a little weird though. It first outputs the file name proceeded with the contents of the file, which is a little inconvenient. You can try with head --bytes 1000 03.tar
.
Edit: My question is different than the others of a similar nature in that, tarfile seems not to work at all even if I try streaming. Thus, I need a much more efficient approach - something like head
or less
in linux that can ~instantly get a stream without having to extract which tarfile does.