Is there a way to do streaming decompression of single-file zip archives?
I currently have arbitrarily large zipped archives (single file per archive) in s3. I would like to be able to process the files by iterating over them without having to actually download the files to disk or into memory.
A simple example:
import boto
def count_newlines(bucket_name, key_name):
conn = boto.connect_s3()
b = conn.get_bucket(bucket_name)
# key is a .zip file
key = b.get_key(key_name)
count = 0
for chunk in key:
# How should decompress happen?
count += decompress(chunk).count('\n')
return count
This answer demonstrates a method of doing the same thing with gzip'd files. Unfortunately, I haven't been able to get the same technique to work using the zipfile
module, as it seems to require random access to the entire file being unzipped.