5

Is there a way to do streaming decompression of single-file zip archives?

I currently have arbitrarily large zipped archives (single file per archive) in s3. I would like to be able to process the files by iterating over them without having to actually download the files to disk or into memory.

A simple example:

import boto

def count_newlines(bucket_name, key_name):
    conn = boto.connect_s3()
    b = conn.get_bucket(bucket_name)
    # key is a .zip file
    key = b.get_key(key_name)

    count = 0
    for chunk in key:
        # How should decompress happen?
        count += decompress(chunk).count('\n')

    return count

This answer demonstrates a method of doing the same thing with gzip'd files. Unfortunately, I haven't been able to get the same technique to work using the zipfile module, as it seems to require random access to the entire file being unzipped.

Michal Charemza
  • 25,940
  • 14
  • 98
  • 165
Rahul Gupta-Iwasaki
  • 1,683
  • 2
  • 21
  • 39
  • Have you tried adapting that code to use [`zipfile`](https://docs.python.org/2/library/zipfile.html) instead of `zlib`? – MattDMo Mar 31 '15 at 18:01
  • Yep! ZipFile expects random access to the file it's unzipping, so I don't think it'll really work with the s3 iterator.. – Rahul Gupta-Iwasaki Mar 31 '15 at 18:02
  • 1
    See also https://stackoverflow.com/questions/10405210/create-and-stream-a-large-archive-without-storing-it-in-memory-or-on-disk – DNA Mar 31 '15 at 18:32
  • 1
    Both tar and gzip were designed to work with data streams. Zip, however, was not. So the best answer to this question would be to simply not use that format. – rspeed Aug 06 '17 at 16:48

5 Answers5

3

While I suspect it's not possible with absolutely all zip files, I also suspect almost(?) all modern zip files are streaming-compatible, and it is possible to do streaming decompression, for example using https://github.com/uktrade/stream-unzip [full disclosure: originally written by me]

The example from its README shows how to do this with an arbitrary http request using httpx

from stream_unzip import stream_unzip
import httpx

def zipped_chunks():
    # Any iterable that yields a zip file
    with httpx.stream('GET', 'https://www.example.com/my.zip') as r:
        yield from r.iter_bytes(chunk_size=65536)

for file_name, file_size, unzipped_chunks in stream_unzip(zipped_chunks()):
    for chunk in unzipped_chunks:
        print(chunk)

but I think it could be adapted for boto3 to stream unzip/decompress from S3 (untested):

from stream_unzip import stream_unzip
import boto3

def zipped_chunks():
    yield from boto3.client('s3', region_name='us-east-1').get_object(
        Bucket='my-bucket-name',
        Key='the/key/of/the.zip'
    )['Body'].iter_chunks()

for file_name, file_size, unzipped_chunks in stream_unzip(zipped_chunks()):
    for chunk in unzipped_chunks:
        print(chunk)
Michal Charemza
  • 25,940
  • 14
  • 98
  • 165
2

Yes, but you'll likely have to write your own code to do it if it has to be in Python. You can look at sunzip for an example in C for how to unzip a zip file from a stream. sunzip creates temporary files as it decompresses the zip entries, and then moves those files and sets their attributes appropriately upon reading the central directory at the end. Claims that you must be able to seek to the central directory in order to properly unzip a zip file are incorrect.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
1

The zip header is at the end of the file, which is why it needs random access. See https://en.wikipedia.org/wiki/Zip_(file_format)#Structure.

You could parse the local file header which should be at the start of the file for a simple zip, and decompress the bytes with zlib (see zipfile.py). This is not a valid way to read a zip file, and while it might work for your specific scenario, it could also fail on a lot of valid zips. Reading the central directory file header is the only right way to read a zip.

rezca
  • 116
  • 6
1

You can use https://pypi.python.org/pypi/tubing, it even has built in s3 source support using boto3.

from tubing.ext import s3
from tubing import pipes, sinks
output = s3.S3Source(bucket, key) \
    | pipes.Gunzip() \
    | pipes.Split(on=b'\n') \
    | sinks.Objects()
print len(output)

If you didn't want to store the entire output in the returned sink, you could make your own sink that just counts. The impl would look like:

class CountWriter(object):
    def __init__(self):
        self.count = 0
    def write(self, chunk):
        self.count += len(chunk)
Counter = sinks.MakeSink(CountWriter)
doki_pen
  • 106
  • 4
-3

You can do it in Python 3.4.3 using ZipFile as follows:

with ZipFile('spam.zip') as myzip:
    with myzip.open('eggs.txt') as myfile:
        print(myfile.read())

Python Docs