13

I am trying to extract a zipped folder but instead of directly using .extractall(), I want to extract the file into stream so that I can handle the stream myself. Is it possible to do it using tarfile? Or is there any suggestions?

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Robin W.
  • 361
  • 1
  • 3
  • 14

2 Answers2

23

You can obtain each file from a tar file as a python file object using the .extractfile() method. Loop over the tarfile.TarFile() instance to list all entries:

import tarfile

with tarfile.open(path) as tf:
    for entry in tf:  # list each entry one by one
        fileobj = tf.extractfile(entry)
        # fileobj is now an open file object. Use `.read()` to get the data.
        # alternatively, loop over `fileobj` to read it line by line.
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • 2
    And if the fileobj is a gzip file, would it be possible to decompress it? – Werner Sep 09 '15 at 12:53
  • 1
    @Werner: the `tarfile` module takes care of compression for you. See the [`tarfile.open()` documentation](https://docs.python.org/2/library/tarfile.html#tarfile.open), the default mode is `r`, which transparently detects compression and handles decompression as needed. – Martijn Pieters Sep 09 '15 at 12:54
  • 2
    Yes, but inside the tarfile I have a gzip file (unfortunately someone created a compressed tarfile with my gzip file…). The `extractfile` returns a `tarfile.ExFileObject` which cannot be used to open a gzip.GzipFile. Would there be a way to open this gzip file without decompressing the tarfile and open the new system file? – Werner Sep 09 '15 at 12:58
  • 1
    @Werner: I take it you are using Python 2 then? Python 3's `gzip` module should take that object without issues, but the Python 2 version still tries to seek on the file object. Either upgrade to Python 3, or copy the file to disk first, or decode the stream as you read it, see [Python decompressing gzip chunk-by-chunk](http://stackoverflow.com/q/2423866) – Martijn Pieters Sep 09 '15 at 13:03
  • Yes, still on python 2, unfortunately, and it's not possible to upgrade as it makes part of the environment. Ok, thanks a lot! Couldn't find any information on this… – Werner Sep 09 '15 at 13:10
  • Evidently you need to be careful with directories; you'll get an entry for them, but when you call `extractfile()` on them, `None` will be returned. – weberc2 Dec 02 '15 at 14:49
  • @MartijnPieters A side question (not sure if it's worth putting as a standalone question), what needs to be done for cleanup of a file object obtained from the `extractfile` method? Is the file extracted on disk anywhere and does that need explicit deletion? (python3) – 0xc0de Jun 08 '20 at 09:57
  • @0xc0de: `extractfile` reads directly from the `TarFile` stream, no temp files are created on disk, no cleanup is needed. – Martijn Pieters Jun 09 '20 at 22:42
1

I was unable to extractfile while network streaming a tar file, I did something like this instead:

from backports.lzma import LZMAFile
import tarfile
some_streamed_tar = LZMAFile(requests.get('http://some.com/some.tar.xz').content)
with tarfile.open(fileobj=some_streamed_tar) as tf:
    tarfileobj.extractall(path="/tmp", members=None)

And to read them:

for fn in os.listdir("/tmp"):
    with open(os.path.join(t, fn)) as f:
        print(f.read())

python 2.7.13

jmunsch
  • 22,771
  • 11
  • 93
  • 114
  • You can also achieve this directly with streaming, i.e. without any temporary files: https://stackoverflow.com/a/34131505/19163 – vog Jun 01 '18 at 08:58