1

I have an app with manages a set of files, but those files are actually stored in Rackspace's CloudFiles, because most of the files will be ~100GB. I'm using the Cloudfile's TempURL feature to allow individual files, but sometimes, the user will want to download a set of files. But downloading all those files and generating a local Zip file is impossible since the server only have 40GB of disk space.

From the user view, I want to implement it the way GMail does when you get an email with several pictures: It gives you a link to download a Zip file with all the images in it, and the download is immediate.

How to accomplish this with Python/Django? I have found ZipStream and looks promising because of the iterator output, but it still only accepts filepaths as arguments, and the writestr method would need to fetch all the file data at once (~100GB).

Armando Pérez Marqués
  • 5,661
  • 4
  • 28
  • 45

3 Answers3

4

Since Python 3.5 it is possible to create zip chunks stream of huge files/folders. You can use the unseekable stream. So no need to use ZipStream now. See my answer here.

And live example here: https://repl.it/@IvanErgunov/zipfilegenerator

If you don't have filepath, but have chunks of bytes you can exclude open(path, 'rb') as entry from example and replace iter(lambda: entry.read(16384), b'') with your iterable of bytes. And prepare ZipInfo manually:

zinfo = ZipInfo(filename='any-name-of-your-non-existent-file', date_time=time.localtime(time.time())[:6])
zinfo.compress_type = zipfile.ZIP_STORED
# permissions:
if zinfo.filename[-1] == '/':
   # directory
   zinfo.external_attr = 0o40775 << 16   # drwxrwxr-x
   zinfo.external_attr |= 0x10           # MS-DOS directory flag
else:
   # file
   zinfo.external_attr = 0o600 << 16     # ?rw-------

You should also remember that the zipfile module writes chunks of its zipfile own size. So, if you send a piece of 512 bytes the stream will receive a piece of data only when and only with size the zipfile module decides to do it. It depends on the compression algorithm, but I think it is not a problem, because the zipfile module makes small chunks <= 16384.

don_vanchos
  • 1,280
  • 13
  • 13
1

You can use https://pypi.python.org/pypi/tubing. Here's an example using s3, you could pretty easily create a rackspace clouldfile Source. Create a customer Writer (instead of sinks.Objects) to stream the data some where else and custom Transformers to transform the stream.

from tubing.ext import s3
from tubing import pipes, sinks
output = s3.S3Source(bucket, key) \
    | pipes.Gunzip() \
    | pipes.Split(on=b'\n') \
    | sinks.Objects()
print len(output)
doki_pen
  • 106
  • 4
0

Check this out - it's part of the Python Standard Library: http://docs.python.org/3/library/zipfile.html#zipfile-objects

You can give it an open file or file-like-object.

dstromberg
  • 6,954
  • 1
  • 26
  • 27
  • Thanks @dstromberg, but it still doesn't solve the problem of writing a stream or a data iterator to the Zip file. The [`write`](http://docs.python.org/3/library/zipfile.html#zipfile.ZipFile.write) method still requires a filepath as the first argument. – Armando Pérez Marqués Dec 27 '13 at 03:15
  • [This answer](http://stackoverflow.com/a/10235749/288457) is close to what I need but it reads the whole image data and pass it to the [`writestr`](http://docs.python.org/3/library/zipfile.html#zipfile.ZipFile.writestr) method, but this is impossible when the files are +100GB. – Armando Pérez Marqués Dec 27 '13 at 03:19