1

I have a large zip file that I would like to unzip, without loading all of its bytes into memory (to be done concurrently with fetching the zipped bytes via an http request)

How can this be done from Python?

Note: I am specifically asking about the zip format, not gzip. Questions such as Python unzipping stream of bytes?, although often use the word "zip", appear to be about gzip.

Michal Charemza
  • 25,940
  • 14
  • 98
  • 165
  • Does this answer your question? [Python unzipping stream of bytes?](https://stackoverflow.com/questions/12571913/python-unzipping-stream-of-bytes) – Anton Curmanschii May 16 '21 at 08:32
  • @AntonCurmanschii I don't think so: although that question's title says "zip", I think the contents are more about gzip? – Michal Charemza May 16 '21 at 08:34

3 Answers3

1

It is possible to do this from within Python, without calling to an external process, and it can handle all the files in the zip, not just the first.

This can be done by using stream-unzip [disclaimer: written by me].

from stream_unzip import stream_unzip
import httpx

def zipped_chunks():
    with httpx.stream('GET', 'https://www.example.com/my.zip') as r:
        yield from r.iter_bytes()

for file_name, file_size, file_chunks in stream_unzip(zipped_chunks()):
    for chunk in file_chunks:
        print(chunk)
Michal Charemza
  • 25,940
  • 14
  • 98
  • 165
0

By calling funzip from within Python, which be done using iterable-subprocess [disclaimer: written by me], you can unzip the first file in a ZIP archive:

from iterable_subprocess import iterable_subprocess
import httpx

def zipped_chunks():
    with httpx.stream('GET', 'https://www.example.com/my.zip') as r:
        yield from r.iter_bytes()

for chunk in iterable_subprocess(['funzip'], zipped_chunks()):
    print(chunk)
Michal Charemza
  • 25,940
  • 14
  • 98
  • 165
0

In my case I needed to unzip .zst files. So I had to use a more manual approach.

I'll post my code here for anyone with a similar case and came here for the title "Stream unZIP archive"


import httpx, zstandard
from io import BytesIO

io = BytesIO(b'')
cctx = zstandard.ZstdDecompressor(max_window_size=2147483648)
writer = cctx.stream_writer(io)

i = 0
with httpx.stream('GET', 'https://files.pushshift.io/reddit/comments/RC_2006-07.zst') as r:
    for chunk in r.iter_bytes():
        writer.write(chunk)  # writer unzips the chunk and "io.write"'s the new unzipped data
        io.seek(i)  # set offset to read added chunk
        data = io.read()  # get new unzipped chunk
        i += len(data)  # data(unzipped) length not chunk(zipped) length for offset
        # io.seek(0); io.truncate(); i = 0  # clear io
        
        # then do what you want with the data
        txt = data.decode()
        with open('test.txt', 'a') as f:
            # edit txt -> txt = filter(txt)
            f.write(txt)

a stackoverflow answer mentions the line io.seek(0);io.truncate() being slow. It's my attempt to not have everything not on memory if needed

Daniel Olson
  • 73
  • 1
  • 3