Create and stream a large archive without storing it in memory or on disk

Question

I want to allow users to download an archive of multiple large files at once. However, the files and the archive may be too large to store in memory or on disk on my server (they are streamed in from other servers on the fly). I'd like to generate the archive as I stream it to the user.

I can use Tar or Zip or whatever is simplest. I am using django, which allows me to return a generator or file-like object in my response. This object could be used to pump the process along. However, I am having trouble figuring out how to build this sort of thing around the zipfile or tarfile libraries, and I'm afraid they may not support reading files as they go, or reading the archive as it is built.

This answer on converting an iterator to a file-like object might help. tarfile#addfile takes an iterable, but it appears to immediately pass that to shutil.copyfileobj, so this may not be as generator-friendly as I had hoped.

In general, compression utilities like zip or tar need to read the entire input file in order to determine what can and should be compressed. So I think your architecture idea is flawed. — jedwards, May 01 '12 at 22:25
@jedwards: quite wrong; `tar` is just a container, no compression. It was designed to work with _tapes_ -- where reading the entire thing first was out of the question. And `zlib` will _happily_ compress a stream of data. You can get _better_ compression with full-file awareness, but that is by no means mandatory. — sarnold, May 01 '12 at 22:26
@sarnold, I assumed he meant a compressed tarball since he was talking about compression. And zlib still needs to cache a number of input bytes before generating output because the same requirement of analyzing the input data remains. So I agree I misspoke with "entire" but I still hold that this makes little sense as the amount of data you will save by compressing small stream segments will be negligible to the amount of work spent writing. — jedwards, May 01 '12 at 22:30
@jedwards, the more nuanced response is significantly better, but consider if the data is coming from radio antenna or a microphone -- there might be ample opportunity to store streaming data in a format more convenient for transfer and unpacking without storing all the data first for analysis. — sarnold, May 01 '12 at 22:35
@jedwards I never mentioned compression. Compression can be handled by apache mod_gzip. Also I don't know what you mean about 'saving data'. The goal here is to reduce the memory usage on the server, allowing it to simply stream data from one place to another without ever holding onto too much data at once. — Nick Retallack, May 01 '12 at 23:01
The tar format doesn't seem so rough; you may need to write your own tools if the standard API doesn't provide what you need. — sarnold, May 01 '12 at 23:05
@NickRetallack, I didn't say saving data so I'm not sure to what you're referencing. But if, despite talking about zip and tarballs, you don't want to handle compression, then all my points are moot. — jedwards, May 01 '12 at 23:05
Are you able to get the file size of files you're streaming in without downloading the entire file? If so you should be able to package the files in a tar according to [the specification](http://www.gnu.org/software/tar/manual/html_node/Standard.html) — jedwards, May 01 '12 at 23:09

score 9 · Accepted Answer · answered Oct 10 '12 at 01:42

9

I ended up using SpiderOak ZipStream.

answered Oct 10 '12 at 01:42

Nick Retallack

18,986
17
92
114

score 7 · Answer 2 · answered May 02 '12 at 02:48

You can do it by generating and streaming a zip file with no compression, which is basically to just add the headers before each file's content. You're right, the libraries don't support this, but you can hack around them to get it working.

This code wraps zipfile.ZipFile with a class that manages the stream and creates instances of zipfile.ZipInfo for the files as they come. CRC and size can be set at the end. You can push data from the input stream into it with put_file(), write() and flush(), and read data out of it to the output stream with read().

import struct      
import zipfile
import time

from StringIO import StringIO

class ZipStreamer(object):
    def __init__(self):
        self.out_stream = StringIO()

        # write to the stringIO with no compression
        self.zipfile = zipfile.ZipFile(self.out_stream, 'w', zipfile.ZIP_STORED)

        self.current_file = None

        self._last_streamed = 0

    def put_file(self, name, date_time=None):
        if date_time is None:
            date_time = time.localtime(time.time())[:6]

        zinfo = zipfile.ZipInfo(name, date_time)
        zinfo.compress_type = zipfile.ZIP_STORED
        zinfo.flag_bits = 0x08
        zinfo.external_attr = 0600 << 16
        zinfo.header_offset = self.out_stream.pos

        # write right values later
        zinfo.CRC = 0
        zinfo.file_size = 0
        zinfo.compress_size = 0

        self.zipfile._writecheck(zinfo)

        # write header to stream
        self.out_stream.write(zinfo.FileHeader())

        self.current_file = zinfo

    def flush(self):
        zinfo = self.current_file
        self.out_stream.write(struct.pack("<LLL", zinfo.CRC, zinfo.compress_size, zinfo.file_size))
        self.zipfile.filelist.append(zinfo)
        self.zipfile.NameToInfo[zinfo.filename] = zinfo
        self.current_file = None

    def write(self, bytes):
        self.out_stream.write(bytes)
        self.out_stream.flush()
        zinfo = self.current_file
        # update these...
        zinfo.CRC = zipfile.crc32(bytes, zinfo.CRC) & 0xffffffff
        zinfo.file_size += len(bytes)
        zinfo.compress_size += len(bytes)

    def read(self):
        i = self.out_stream.pos

        self.out_stream.seek(self._last_streamed)
        bytes = self.out_stream.read()

        self.out_stream.seek(i)
        self._last_streamed = i

        return bytes

    def close(self):
        self.zipfile.close()

Keep in mind that this code was just a quick proof of concept and I did no further development or testing once I decided to let the http server itself deal with this problem. A few things you should look into if you decide to use it is to check if nested folders are archived correctly, and filename encoding (which is always a pain with zip files anyway).

One thing I'm worried about is, when you use StringIO, will that end up collecting all the data in memory? Does the stuff that's already been read out of the StringIO ever get freed? — Nick Retallack, May 02 '12 at 04:05
You're probably right, but StringIO is not essential to this implementation, just the easiest to use. You can make a file like object that always deal with the last chunk only. — Pedro Werneck, May 02 '12 at 10:12

score 7 · Answer 3 · answered Jun 11 '14 at 23:03

You can stream a ZipFile to a Pylons or Django response fileobj by wrapping the fileobj in something file-like that implements tell(). This will buffer each individual file in the zip in memory, but stream the zip itself. We use it to stream download a zip file full of images, so we never buffer more than a single image in memory.

This example streams to sys.stdout. For Pylons use response.body_file, for Django you can use the HttpResponse itself as a file.

import zipfile
import sys


class StreamFile(object):
    def __init__(self, fileobj):
        self.fileobj = fileobj
        self.pos = 0

    def write(self, str):
        self.fileobj.write(str)
        self.pos += len(str)

    def tell(self):
        return self.pos

    def flush(self):
        self.fileobj.flush()


# Wrap a stream so ZipFile can use it
out = StreamFile(sys.stdout)
z = zipfile.ZipFile(out, 'w', zipfile.ZIP_DEFLATED)

for i in range(5):
    z.writestr("hello{0}.txt".format(i), "this is hello{0} contents\n".format(i) * 3)

z.close()

But what you do when `this is hello{0} contents\n` is 10GB in size? — Adi Roiban, Feb 11 '19 at 14:11

dm2013 · Answer 4 · 2017-10-12T08:27:25.860

Here is the solution from Pedro Werneck (from above) but with a fix to avoid collecting all data in memory (read method is fixed a little bit):

class ZipStreamer(object):
    def __init__(self):
        self.out_stream = StringIO.StringIO()

        # write to the stringIO with no compression
        self.zipfile = zipfile.ZipFile(self.out_stream, 'w', zipfile.ZIP_STORED)

        self.current_file = None

        self._last_streamed = 0

    def put_file(self, name, date_time=None):
        if date_time is None:
            date_time = time.localtime(time.time())[:6]

        zinfo = zipfile.ZipInfo(name, date_time)
        zinfo.compress_type = zipfile.ZIP_STORED
        zinfo.flag_bits = 0x08
        zinfo.external_attr = 0600 << 16
        zinfo.header_offset = self.out_stream.pos

        # write right values later
        zinfo.CRC = 0
        zinfo.file_size = 0
        zinfo.compress_size = 0

        self.zipfile._writecheck(zinfo)

        # write header to mega_streamer
        self.out_stream.write(zinfo.FileHeader())

        self.current_file = zinfo

    def flush(self):
        zinfo = self.current_file
        self.out_stream.write(
            struct.pack("<LLL", zinfo.CRC, zinfo.compress_size,
                        zinfo.file_size))
        self.zipfile.filelist.append(zinfo)
        self.zipfile.NameToInfo[zinfo.filename] = zinfo
        self.current_file = None

    def write(self, bytes):
        self.out_stream.write(bytes)
        self.out_stream.flush()
        zinfo = self.current_file
        # update these...
        zinfo.CRC = zipfile.crc32(bytes, zinfo.CRC) & 0xffffffff
        zinfo.file_size += len(bytes)
        zinfo.compress_size += len(bytes)

    def read(self):
        self.out_stream.seek(self._last_streamed)
        bytes = self.out_stream.read()
        self._last_streamed = 0

        # cleaning up memory in each iteration
        self.out_stream.seek(0) 
        self.out_stream.truncate()
        self.out_stream.flush()

        return bytes

    def close(self):
        self.zipfile.close()

then you can use stream_generator function as a stream for a zip file

def stream_generator(files_paths):
    s = ZipStreamer()
    for f in files_paths:
        s.put_file(f)
        with open(f) as _f:
            s.write(_f.read())
        s.flush()
        yield s.read()
    s.close()

example for Falcon:

class StreamZipEndpoint(object):
    def on_get(self, req, resp):
        files_pathes = [
            '/path/to/file/1',
            '/path/to/file/2',
        ]
        zip_filename = 'output_filename.zip'
        resp.content_type = 'application/zip'
        resp.set_headers([
            ('Content-Disposition', 'attachment; filename="%s"' % (
                zip_filename,))
        ])

        resp.stream = stream_generator(files_pathes)

Scribbling data to a zipfile in successive chunks should be *easy*. ZipInfo objects should just support "write" operations if they are at the end (latest added) of the archive. — Erik Aronesty, Jun 12 '18 at 14:48

Michal Charemza · Answer 5 · 2022-01-04T08:26:45.793

An option is to use stream-zip (full disclosure: written by me)

Amending its example slightly:

from datetime import datetime
from stream_zip import stream_zip, ZIP_64

def non_zipped_files():
    modified_at = datetime.now()
    perms = 0o600

    # Hard coded in this example, but in real cases could
    # for example yield data from a remote source
    def file_1_data():
        for i in range(0, 1000):
            yield b'Some bytes'

    def file_2_data():
        for i in range(0, 1000):
            yield b'Some bytes'

    yield 'my-file-1.txt', modified_at, perms, ZIP64, file_1_data()
    yield 'my-file-2.txt', modified_at, perms, ZIP64, file_2_data()

zipped_chunks = stream_zip(non_zipped_files())

# Can print each chunk, or return them to a client,
# say using Django's StreamingHttpResponse
for zipped_chunk in zipped_chunks:
    print(zipped_chunk)

Create and stream a large archive without storing it in memory or on disk

5 Answers5

Linked