accessing one file at a time from a large (40GB) tar file in python

Question

I'm trying to access a large tarball (tar.gz) in python. The tarball contains multiple mp3 or wav files. I'd like to read each file individually and do the processing that I would like to do.

I did look at a few of the suggestions available here: this and this.

Both solutions offer reading the table of contents, but not accessing/reading each file at a time. The other solutions I have seen refer to extracting the entire tarball - I do not have so much place left on my disk to do so.

Any help in this regard will be appreciated.

Possible duplicate: https://stackoverflow.com/questions/20434912/is-it-possible-to-extract-single-file-from-tar-bundle-in-python — ForceBru, Dec 21 '19 at 16:17
So the `tarfile` documentation doesn't document this? Are you really sure? — Stefan Pochmann, Dec 21 '19 at 16:18
I just stumbled on the `next()` function. Probably I can use the yield functoin — tandem, Dec 21 '19 at 16:19
Your file consists of two layers, a tar archive with the files inside and an outer gzip compressing. The compression prevents to navigate to a position in the archive directly. At least a part of every compression block has to be read. This makes it rather inefficient to extract a single file from a compressed tar archive. — Klaus D., Dec 21 '19 at 16:40
@KlausD. that's what I thought too, but I couldn't verify. I downloaded some tar.gz file (around 100mb compressed, 600mb uncompressed) and monitored the memory consumption during the runtime of the code I put in my answer. It never went above 40mb, even if reading the content of all files. If I open the same file in 7zip, it uses 650mb of memory. Is there anything I missed? — He3lixxx, Dec 21 '19 at 16:45
I'm not concerned about the memory usage. It could be rather small if done properly. My concern is the access time if most of the large file has to be read. Especially since these large files are oftenly stored on slow devices. — Klaus D., Dec 21 '19 at 17:33

score 0 · Answer 1 · answered Dec 21 '19 at 16:20

0

You can use TarFile.extractfile to get a buffered reader on each file in the archive without decompressing the others.

import tarfile

with tarfile.open("test.tar.gz") as archive:
    for member in archive:
        file_obj = archive.extractfile(member)
        print(file_obj)

answered Dec 21 '19 at 16:20

He3lixxx

3,263
1
12
31

I have to agree that this solution is really slow. are you aware of any other libraries? – tandem Dec 22 '19 at 06:58

accessing one file at a time from a large (40GB) tar file in python

1 Answers1