How to use Python3.6 tarfile module to read from memory?

Question

I would like to download a tarfile from url to memory and than extract all its content to folder dst. What should I do?

Below are my attempts but I could not achieve my plan.

#!/usr/bin/python3.6
# -*- coding: utf-8 -*-

from pathlib import Path
from io import BytesIO
from urllib.request import Request, urlopen
from urllib.error import URLError
from tarfile import TarFile


def get_url_response( url ):
    req = Request( url )
    try:
        response = urlopen( req )
    except URLError as e:
        if hasattr( e, 'reason' ):
            print( 'We failed to reach a server.' )
            print( 'Reason: ', e.reason )
        elif hasattr( e, 'code'):
            print( 'The server couldn\'t fulfill the request.' )
            print( 'Error code: ', e.code )
    else:
        # everything is fine
        return response

url = 'https://dl.opendesktop.org/api/files/download/id/1566630595/s/6cf6f74c4016e9b83f062dbb89092a0dfee862472300cebd0125c7a99463b78f4b912b3aaeb23adde33ea796ca9232decdde45bb65a8605bfd8abd05eaee37af/t/1567158438/c/6cf6f74c4016e9b83f062dbb89092a0dfee862472300cebd0125c7a99463b78f4b912b3aaeb23adde33ea796ca9232decdde45bb65a8605bfd8abd05eaee37af/lt/download/Blue-Maia.tar.xz'
dst = Path().cwd() / 'Tar'

response = get_url_response( url )

with TarFile( BytesIO( response.read() ) ) as tfile:
    tfile.extractall( path=dst )

However, I got this error:

Traceback (most recent call last):
  File "~/test_tar.py", line 31, in <module>
    with TarFile( BytesIO( response.read() ) ) as tfile:
  File "/usr/lib/python3.6/tarfile.py", line 1434, in __init__
    fileobj = bltn_open(name, self._mode)
TypeError: expected str, bytes or os.PathLike object, not _io.BytesIO

I tried passing the BytesIO object to TarFile as a fileobj:

with TarFile( fileobj=BytesIO( response.read() ) ) as tfile:
    tfile.extractall( path=dst )

However, it still can't work:

Traceback (most recent call last):
  File "/usr/lib/python3.6/tarfile.py", line 188, in nti
    s = nts(s, "ascii", "strict")
  File "/usr/lib/python3.6/tarfile.py", line 172, in nts
    return s.decode(encoding, errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd2 in position 0: ordinal not in range(128)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/tarfile.py", line 2297, in next
    tarinfo = self.tarinfo.fromtarfile(self)
  File "/usr/lib/python3.6/tarfile.py", line 1093, in fromtarfile
    obj = cls.frombuf(buf, tarfile.encoding, tarfile.errors)
  File "/usr/lib/python3.6/tarfile.py", line 1035, in frombuf
    chksum = nti(buf[148:156])
  File "/usr/lib/python3.6/tarfile.py", line 191, in nti
    raise InvalidHeaderError("invalid header")
tarfile.InvalidHeaderError: invalid header

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "~/test_tar.py", line 31, in <module>
    with TarFile( fileobj=BytesIO( response.read() ) ) as tfile:
  File "/usr/lib/python3.6/tarfile.py", line 1482, in __init__
    self.firstmember = self.next()
  File "/usr/lib/python3.6/tarfile.py", line 2309, in next
    raise ReadError(str(e))
tarfile.ReadError: invalid header

score 4 · Accepted Answer · answered Aug 30 '19 at 09:58

This approach was very close to correct:

with TarFile( fileobj=BytesIO( response.read() ) ) as tfile:
    tfile.extractall( path=dst )

You should use tarfile.open instead of TarFile (see docs), and tell it that you are reading an xz file (mode='r:xz'):

with tarfile.open( fileobj=BytesIO( response.read() ), mode='r:xz' ) as tfile:
    tfile.extractall( path=dst )

However, as you'll notice, this is still not enough.

The root problem? You're downloading from a site which disallows hotlinking. The website is blocking your attempt to download. Try printing out the response and you'll see you get a load of junk HTML instead of a tar.xz file.

I used another `.tar.xz` type url that allowed downloading. Yes, using the `tarfile.open()` function worked. Thank you also for the reference, I overlooked it. Any chance/way to circumvent the hotlinking? — Sun Bear, Aug 30 '19 at 14:10

olinox14 · Answer 2 · 2019-08-30T14:47:40.050

2

Strangely, I manage to make it work using the open() function, but not by instanciating a TarFile object. It seems the opening mode can not be set correctly in the second one...

Anyway, this works:

from _io import BytesIO
import tarfile

with open('Blue-Maia.tar.xz', 'rb') as f:
    tar = tarfile.open(fileobj=BytesIO( f.read() ), mode="r:xz")
    tar.extractall( path="test" )
    tar.close()

You could add a try...except...finally to ensure the tar file is always closed.

Update:

In your code:

response = get_url_response( url )
tar = tarfile.open(fileobj=BytesIO( response.read() ), mode="r:xz")
tar.extractall( path="test" )
tar.close()

edited Aug 30 '19 at 14:47

answered Aug 30 '19 at 09:32

olinox14

6,177
2
22
39

Does your approach write to memory? I did not see `BytesIO` used. Can you please explain? – Sun Bear Aug 30 '19 at 10:28
Btw, your `with open()` statement returned `FileNotFoundError: [Errno 2] No such file or directory: 'Blue-Maia.tar.xz'` – Sun Bear Aug 30 '19 at 10:46
Oh sorry, I tried a few things and I made a mistake while posting the solution... Fixed – olinox14 Aug 30 '19 at 12:16
And the `with open` is just here to replace the fileobject you get with your `get_url_response` method, the lines you need are the last 3 – olinox14 Aug 30 '19 at 12:18
Can you show how your code links with `response` from my code? I can't see it with your script. The statement `with open('Blue-Maia.tar.xz', 'rb') as f` means that you are opening a file called "Blue-Maia.tar.xz" which pre-exist in your current working directory and you are assigning this opened file to `f`. – Sun Bear Aug 30 '19 at 14:37
I did it because a filelike object (resulting from the `with open`) has the very same `read()` method that your `response` object, it was just to shorten the code and got to the essential... However, I updated my answer. – olinox14 Aug 30 '19 at 14:47
With your updated code, I got `tarfile.ReadError: not an lzma file`. – Sun Bear Aug 30 '19 at 14:47
Ah, strange. But the same error happens whith the code from the accepted answer, why did you accept it if this is not working? – olinox14 Aug 30 '19 at 14:54
1

Thanks for helping. Your updated answer finally looks similar to the answer by @Score_Under. I accepted that answer because it first explained my mistake and showed the correct syntax that I should have used, which answered my question. ;) – Sun Bear Aug 30 '19 at 15:11

How to use Python3.6 tarfile module to read from memory?

2 Answers2