8

I've got a python method which needs to collect lots of data from an API, format it into a CSV, compress it and stream the result back.

I've been Googling and every solution I can find either requires writing to a temp file or keeping the whole archive in memory.

Memory is definitely not an option as I'd get OOM pretty rapidly. Writing to a temp file has a slew of problems associated with it (this box only uses disk for logs at the moment, much longer lead time before download starts, file cleanup issues, etc, etc). Not to mention the fact that it's just nasty.

I'm looking for a library that will allow me to do something like...

C = Compressor(outputstream)
C.BeginFile('Data.csv')
for D in Api.StreamResults():
    C.Write(D)
C.CloseFile()
C.Close()

In other words, something that will be writing the output stream as I write data in.

I've managed to do this in .Net and PHP - but I have no idea how to approach it in Python.

To put things into perspective, by "lots" of data, I mean I need to be able to handle up to ~10 Gb of (raw plaintext) data. This is part of an export/dump process for a big data system.

Basic
  • 26,321
  • 24
  • 115
  • 201
  • What's wrong with [zipfile.ZipFile](http://docs.python.org/2/library/zipfile#zipfile.ZipFile)? The `write` method writes compressed bytes as you get them. If you pass another file-like object (a stream, for example) to the Zipfile constructor, it'll write to it. You can also use [gzip](http://docs.python.org/2/library/gzip.html) – loopbackbee Oct 01 '13 at 14:27
  • 1
    @goncalopp The `write()` method takes a filename. I'm generating the data on-the-fly. `writestr()` would require me to collate all the data in memory – Basic Oct 01 '13 at 14:29
  • 1
    you're right, I was convinced zipfile was similar to gzip. [`gzip.Gzip.write`](http://docs.python.org/2/library/gzip.html#examples-of-usage) can do what you're asking. Do you really need a zip stream? – loopbackbee Oct 01 '13 at 14:33
  • to amplify @goncalopp a gzip object will accept multiple `write()` calls as is required by the OP – msw Oct 01 '13 at 14:56
  • Have a look at [create and stream a large archive without storing it in memory or on disk](http://stackoverflow.com/questions/10405210/create-and-stream-a-large-archive-without-storing-it-in-memory-or-on-disk) they mention [SpiderOak ZipStream](https://github.com/gourneau/SpiderOak-zipstream) – Rod Oct 01 '13 at 15:06
  • @goncalopp Good question - and the answer is "perhaps not". I'll see if gzip will do the trick. Edit: Correct me if I'm wrong, but gzip doesn't seem to support writing to stream, just file? – Basic Oct 01 '13 at 15:58
  • @Rod thanks for the link but I've stumbled across that before - I couldn't find any documentation and the only example seems to involve zipping a whole directory - which is what the OP was after in the question – Basic Oct 01 '13 at 16:01
  • The trick to using gzip is to make its stdout point to a file. Its a great solution (gzip, which is cpu intensive, will run in parallel). But, you may find your program does some blocking as it writes to gzip's stdin. – tdelaney Oct 01 '13 at 16:08
  • @tdelaney Are you referring to the shell command for gzip? If so, that's problematic as my code currently runs on both windows and linux machines. Presumably I'd also need 2 threads - one to pipe data to stdin and another to read from stdout and pass it to ym response stream? Apologies if I've misunderstood. – Basic Oct 01 '13 at 16:12
  • @Basic "a file", in python, is used [in the unix sense](http://en.wikipedia.org/wiki/Everything_is_a_file), so a stream is a file, too. Since you have duck-typing, it's easy to implement your own custom stream implementation, if needed. Of course, `sys.stdout` is a file, too... – loopbackbee Oct 01 '13 at 16:13
  • @goncalopp Thanks, I mis-read this `When fileobj is _not_ None, the filename argument is only used...`. That looks liek what I need. care to post as an answer so I can accept? – Basic Oct 01 '13 at 16:14
  • 1
    @Basic, yes I was thinking the executable. I like it because it parallelizes a cpu intensive task. You can do it in one thread by piping to the file `subprocess.Popen(['gzip'],stdin=subprocess.PIPE, stdout=open('someoutputfile', 'wb'))`. – tdelaney Oct 01 '13 at 16:17

2 Answers2

11

As the gzip module documentation states, you can pass a file-like object to the GzipFile constructor. Since python is duck-typed, you're free to implement your own stream, like so:

import sys
from gzip import GzipFile

class MyStream(object):
    def write(self, data):
        #write to your stream...
        sys.stdout.write(data) #stdout, for example

gz= GzipFile( fileobj=MyStream(), mode='w'  )
gz.write("something")
loopbackbee
  • 21,962
  • 10
  • 62
  • 97
9

@goncaplopp's answer is great, but you can achieve more parallelism if you run gzip externally. Since you are collecting lots of data, it may be worth the extra effort. You'll need to find your own compression routine for windows (there are several gzip implementations, but something like 7z may work also). You could also experiment with things like lz that compress more than gzip, depending on what else you need to optimize in your system.

import subprocess as subp
import os

class GZipWriter(object):

    def __init__(self, filename):
        self.filename = filename
        self.fp = None

    def __enter__(self):
        self.fp = open(self.filename, 'wb')
        self.proc = subp.Popen(['gzip'], stdin=subp.PIPE, stdout=self.fp)
        return self

    def __exit__(self, type, value, traceback):
        self.close()
        if type:
            os.remove(self.filename)

    def close(self):
        if self.fp:
            self.fp.close()
            self.fp = None

    def write(self, data):
        self.proc.stdin.write(data)

with GZipWriter('sometempfile') as gz:
    for i in range(10):
        gz.write('a'*80+'\n')
tdelaney
  • 73,364
  • 6
  • 83
  • 116
  • 1
    Who, aside from you, said anything about parallelizing? This is an answer in search of a problem. – msw Oct 02 '13 at 00:12
  • 11
    @msw - who said anything about not parallelizing? the question didn't mention any requirement to inline. It did mention that there is a large volume of data to process. After giving kudos to the other answer I gave an alternate implementation and stated a couple of advantages. This type of pipeline is very common in the Linux world. There is nothing earth shattering here. – tdelaney Oct 02 '13 at 00:23