In my case I needed to unzip .zst
files. So I had to use a more manual approach.
I'll post my code here for anyone with a similar case and came here for the title "Stream unZIP archive"
import httpx, zstandard
from io import BytesIO
io = BytesIO(b'')
cctx = zstandard.ZstdDecompressor(max_window_size=2147483648)
writer = cctx.stream_writer(io)
i = 0
with httpx.stream('GET', 'https://files.pushshift.io/reddit/comments/RC_2006-07.zst') as r:
for chunk in r.iter_bytes():
writer.write(chunk) # writer unzips the chunk and "io.write"'s the new unzipped data
io.seek(i) # set offset to read added chunk
data = io.read() # get new unzipped chunk
i += len(data) # data(unzipped) length not chunk(zipped) length for offset
# io.seek(0); io.truncate(); i = 0 # clear io
# then do what you want with the data
txt = data.decode()
with open('test.txt', 'a') as f:
# edit txt -> txt = filter(txt)
f.write(txt)
a stackoverflow answer mentions the line io.seek(0);io.truncate()
being slow. It's my attempt to not have everything not on memory if needed