6

I have a .gz file and I need to get the name of files inside it using python.

This question is the same as this one

The only difference is that my file is .gz not .tar.gz so the tarfile library did not help me here

I am using requests library to request a URL. The response is a compressed file.

Here is the code I am using to download the file

response = requests.get(line.rstrip(), stream=True)
        if response.status_code == 200:
            with open(str(base_output_dir)+"/"+str(current_dir)+"/"+str(count)+".gz", 'wb') as out_file:
                shutil.copyfileobj(response.raw, out_file)
            del response

This code downloads the file with name 1.gz for example. Now if I opened the file with an archive manger the file will contain something like my_latest_data.json

I need to extract the file and the output be my_latest_data.json.

Here is the code I am using to extract the file

inF = gzip.open(f, 'rb')
outfilename = f.split(".")[0]
outF = open(outfilename, 'wb')
outF.write(inF.read())
inF.close()
outF.close()

The outputfilename variable is a string I provide in the script but I need the real file name (my_latest_data.json)

Community
  • 1
  • 1
Fanooos
  • 2,718
  • 5
  • 31
  • 55
  • 2
    The problem is that gzip is *just* compression, not necessarily an archive. There may not be a manifest inside to even look at. – zxq9 Nov 08 '15 at 08:36
  • What's the error? Where's the code you've tried? Your question is unclear. – l'L'l Nov 08 '15 at 08:41
  • Adding to what @zxq9 said, gzip is different from a Zip file (archive) in that it can only "contain" one file. The only thing it may have is the original filename. – Jonathon Reinhart Nov 08 '15 at 08:46
  • @I'L'L kindly check the last edit – Fanooos Nov 08 '15 at 08:49
  • @JonathonReinhart I decided to flesh that out a bit better in an answer -- I imagine the OP isn't the only one who has wondered why this is the case. That said, the OP may want to make this question a bit more general in nature so that others can find it. – zxq9 Nov 08 '15 at 08:51

4 Answers4

10

You can't, because Gzip is not an archive format.

That's a bit of a crap explanation on its own, so let me break this down a bit more than I did in the comment...

Its just compression

Being "just a compression system" means that Gzip operates on input bytes (usually from a file) and outputs compressed bytes. You cannot know whether or not the bytes inside represent multiple files or just a single file -- it is just a stream of bytes that has been compressed. That is why you can accept gzipped data over a network, for example. Its bytes_in -> bytes_out.

What's a manifest?

A manifest is a header within an archive that acts as a table of contents for the archive. Note that now I am using the term "archive" and not "compressed stream of bytes". An archive implies that it is a collection of files or segments that are referred to by a manifest -- a compressed stream of bytes is just a stream of bytes.

What's inside a Gzip, anyway?

A somewhat simplified description of a .gz file's contents is:

  1. A header with a special number to indicate its a gzip, a version and a timestamp (10 bytes)
  2. Optional headers; usually including the original filename (if the compression target was a file)
  3. The body -- some compressed payload
  4. A CRC-32 checksum at the end (8 bytes)

That's it. No manifest.

Archive formats, on the other hand, will have a manifest inside. That's where the tar library would come in. Tar is just a way to shove a bunch of bits together into a single file, and places a manifest at the front that lets you know the names of the original files and what sizes they were before being concatenated into the archive. Hence, .tar.gz being so common.

There are utilities that allow you to decompress parts of a gzipped file at a time, or decompress it only in memory to then let you examine a manifest or whatever that may be inside. But the details of any manifest are specific to the archive format contained inside.

Note that this is different from a zip archive. Zip is an archive format, and as such contains a manifest. Gzip is a compression library, like bzip2 and friends.

zxq9
  • 13,020
  • 1
  • 43
  • 60
  • 1
    You may want to explain what a "manifest" is. It isn't clear that it describes the list of files contained in the archive. – Jonathon Reinhart Nov 08 '15 at 08:53
  • how can the normal archive managers be able to show me the files inside the .gz ? – Fanooos Nov 08 '15 at 08:56
  • @Fanooos They will peek inside a gzip to see whether it contains an archive or not. As mentioned above, there are utilities to let you manipulate gzipped data on the fly (like `zcat` -- do `man zcat` to read about it -- very slick) so it can check internal file headers this way without much overhead. That is also why so many .ps documents are ".ps.gz" these days -- the time/space tradeoff is generally favorable today. – zxq9 Nov 08 '15 at 09:02
  • @JonathonReinhart Done. Thanks. – zxq9 Nov 08 '15 at 09:03
3

As noted in the other answer, your question can only make sense if I take out the plural: "I have a .gz file and I need to get the name of file inside it using python."

A gzip header may or may not have a file name in it. The gzip utility will normally ignore the name in the header, and decompress to a file with the same name as the .gz file, but with the .gz stripped. E.g. your 1.gz would decompress to a file named 1, even if the header has the file name my_latest_data.json in it. The -N option of gzip will use the file name in the header (as well as the time stamp in the header), if there is one. So gzip -dN 1.gz would create the file my_latest_data.json, instead of 1.

You can find the file name in the header in Python by processing the header manually. You can find the details in the gzip specification.

  1. Verify that the first three bytes are 1f 8b 08.
  2. Save the fourth byte. Call it flags. If flags & 8 is zero, then give up -- there is no file name in the header.
  3. Skip the next six bytes.
  4. If flags & 2 is not zero, skip two bytes.
  5. If flags & 4 is not zero, then read the next two bytes. Considering them to be in little endian order, make an integer out of those two bytes, calling it xlen. Then skip xlen bytes.
  6. We already know that flags & 8 is not zero, so you are now at the file name. Read bytes until you get to zero byte. Those bytes up to, but not including the zero byte are the file name.
Community
  • 1
  • 1
Mark Adler
  • 101,978
  • 13
  • 118
  • 158
3

Note: This answer is obsolete as of Python 3.


Using the tips from the Mark Adler reply and a bit of inspection on gzip module I've set up this function that extracts the internal filename from gzip files. I noticed that GzipFile objects have a private method called _read_gzip_header() that almost gets the filename so i did based on that

import gzip

def get_gzip_filename(filepath):
    f = gzip.open(filepath)
    f._read_gzip_header()
    f.fileobj.seek(0)
    f.fileobj.read(3)
    flag = ord(f.fileobj.read(1))
    mtime = gzip.read32(f.fileobj)
    f.fileobj.read(2)
    if flag & gzip.FEXTRA:
        # Read & discard the extra field, if present
        xlen = ord(f.fileobj.read(1))
        xlen = xlen + 256*ord(f.fileobj.read(1))
        f.fileobj.read(xlen)
    filename = ''
    if flag & gzip.FNAME:
        while True:
            s = f.fileobj.read(1)
            if not s or s=='\000':
                break
            else:
                filename += s
    return filename or None
user3840170
  • 26,597
  • 4
  • 30
  • 62
AndreLobato
  • 170
  • 1
  • 12
3

The Python 3 gzip library discards this information but you could adopt the code from around the link to do something else with it.

As noted in other answers on this page, this information is optional anyway. But it's not impossible to retrieve if you need to look if it's there.

import struct


def gzinfo(filename):
    # Copy+paste from gzip.py line 16
    FTEXT, FHCRC, FEXTRA, FNAME, FCOMMENT = 1, 2, 4, 8, 16
    
    with open(filename, 'rb') as fp:
        # Basically copy+paste from GzipFile module line 429f
        magic = fp.read(2)
        if magic == b'':
            return False

        if magic != b'\037\213':
            raise ValueError('Not a gzipped file (%r)' % magic)

        method, flag, _last_mtime = struct.unpack("<BBIxx", fp.read(8))

        if method != 8:
            raise ValueError('Unknown compression method')

        if flag & FEXTRA:
            # Read & discard the extra field, if present
            extra_len, = struct.unpack("<H", fp.read(2))
            fp.read(extra_len)
        if flag & FNAME:
            fname = []
            while True:
                s = fp.read(1)
                if not s or s==b'\000':
                    break
                fname.append(s.decode('latin-1'))
            return ''.join(fname)
        
def main():
    from sys import argv
    for filename in argv[1:]:
        print(filename, gzinfo(filename))

if __name__ == '__main__':
    main()

This replaces the exceptions in the original code with a vague ValueError exception (you might want to fix that if you intend to use this more broadly, and turn this into a proper module you can import) and uses the generic read() function instead of the specific _read_exact() method which goes through some trouble to ensure that it got exactly the number of bytes it requested (this too could be lifted over if you wanted to).

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • On a related question, @RandomDavis [points out](https://stackoverflow.com/questions/66875085/gzip-read32-method-not-available-in-python-3#comment118213420_66875085) this https://bugs.python.org/issue1159051 – tripleee Mar 30 '21 at 17:11
  • Good job @triplee! I was right to stir things up then! ;) This afternoon (Italian time zone) I'll try it! – Memmo Mar 31 '21 at 05:49
  • 1
    This is ultimately quite similar to the Python 2 answer, I notice, though the use of `struct` is probably an improvement. This code appears to work on Python 2 as well. – tripleee Mar 31 '21 at 05:52