36

Here is the situation:

  • I get gzipped xml documents from Amazon S3

      import boto
      from boto.s3.connection import S3Connection
      from boto.s3.key import Key
      conn = S3Connection('access Id', 'secret access key')
      b = conn.get_bucket('mydev.myorg')
      k = Key(b)
      k.key('documents/document.xml.gz')
    
  • I read them in file as

      import gzip
      f = open('/tmp/p', 'w')
      k.get_file(f)
      f.close()
      r = gzip.open('/tmp/p', 'rb')
      file_content = r.read()
      r.close()
    

Question

How can I ungzip the streams directly and read the contents?

I do not want to create temp files, they don't look good.

Michal Charemza
  • 25,940
  • 14
  • 98
  • 165
daydreamer
  • 87,243
  • 191
  • 450
  • 722

4 Answers4

40

Yes, you can use the zlib module to decompress byte streams:

import zlib

def stream_gzip_decompress(stream):
    dec = zlib.decompressobj(32 + zlib.MAX_WBITS)  # offset 32 to skip the header
    for chunk in stream:
        rv = dec.decompress(chunk)
        if rv:
            yield rv
    if dec.unused_data:
        # decompress and yield the remainder
        yield dec.flush()

The offset of 32 signals to the zlib header that the gzip header is expected but skipped.

The S3 key object is an iterator, so you can do:

for data in stream_gzip_decompress(k):
    # do something with the decompressed data
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
10

I had to do the same thing and this is how I did it:

import gzip
f = StringIO.StringIO()
k.get_file(f)
f.seek(0) #This is crucial
gzf = gzip.GzipFile(fileobj=f)
file_content = gzf.read()
Alex
  • 441
  • 5
  • 6
6

For Python3x and boto3-

So I used BytesIO to read the compressed file into a buffer object, then I used zipfile to open the decompressed stream as uncompressed data and I was able to get the datum line by line.

import io
import zipfile
import boto3
import sys

s3 = boto3.resource('s3', 'us-east-1')


def stream_zip_file():
    count = 0
    obj = s3.Object(
        bucket_name='MonkeyBusiness',
        key='/Daily/Business/Banana/{current-date}/banana.zip'
    )
    buffer = io.BytesIO(obj.get()["Body"].read())
    print (buffer)
    z = zipfile.ZipFile(buffer)
    foo2 = z.open(z.infolist()[0])
    print(sys.getsizeof(foo2))
    line_counter = 0
    for _ in foo2:
        line_counter += 1
    print (line_counter)
    z.close()


if __name__ == '__main__':
    stream_zip_file()
Shek
  • 1,543
  • 6
  • 16
  • 34
  • 1
    I noticed that the memory consumption increases significantly when we do `buffer = io.BytesIO(obj.get()["Body"].read())`. However `read(1024)` reading a certain portion of the data keeps the memory usage low! – user 923227 Mar 19 '18 at 21:52
  • 8
    `buffer = io.BytesIO(obj.get()["Body"].read())` reads the whole file into memory. – Kirk Broadhurst May 11 '18 at 18:39
0

You can try PIPE and read contents without downloading file

    import subprocess
    c = subprocess.Popen(['-c','zcat -c <gzip file name>'], shell=True, stdout=subprocess.PIPE,         stderr=subprocess.PIPE)
    for row in c.stdout:
      print row

In addition "/dev/fd/" + str(c.stdout.fileno()) will provide you FIFO file name (Named pipe) which can be passed to other program.