1

So basically I have a file system like this:

main_archive.tar.gz
  main_archive.tar
    sub_archive.xml.gz
      actual_file.xml

There are hundreds of files in this archive... So basically, can the gzip package be used with multiple files in Python 3? I've only used it with a single file zipped so I'm at a loss on how to go over multiple files or multiple levels of "zipping".

My usual method of decompressing is:

with gzip.open(file_path, "rb") as f:
  for ln in f.readlines():
    *decode encoding here*

Of course, this has multiple problems because usually "f" is just a file... But now I'm not sure what it represents?

Any help/advice would be much appreciated!

EDIT 1:

I've accepted the answer below, but if you're looking for similar code, my backbone was basically:

tar = tarfile.open(file_path, mode="r")
for member in tar.getmembers():
    f = tar.extractfile(member)
    if verbose:
        print("Decoding", member.name, "...")
    with gzip.open(f, "rb") as temp:
        decoded = temp.read().decode("UTF-8")
        e = xml.etree.ElementTree.parse(decoded).getroot()
        for child in e:
            print(child.tag)
            print(child.attrib)
            print("\n\n")

tar.close()

Main packages used were gzip, tarfile, and xml.etree.ElementTree.

user3684314
  • 707
  • 11
  • 31

2 Answers2

3

gzip only supports compressing a single file or stream. In your case, the exrtacted stream is a tar object, so you'd use Python's tarfile library to manipulate the extracted contents. This library actually knows how to cope with .tar.gz so you don't need to explicitly extract the gzip yourself.

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • 1
    https://stackoverflow.com/questions/15857792/how-to-construct-a-tarfile-object-in-memory-from-byte-buffer-in-python-3 has some useful code examples. – tripleee Jun 03 '18 at 17:48
0

Use Python's tarfile to get the contained files, and then Python's gzip again inside the loop to extract the xml.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158