Python ungzipping stream of bytes?

Question

Here is the situation:

I get gzipped xml documents from Amazon S3

  import boto
  from boto.s3.connection import S3Connection
  from boto.s3.key import Key
  conn = S3Connection('access Id', 'secret access key')
  b = conn.get_bucket('mydev.myorg')
  k = Key(b)
  k.key('documents/document.xml.gz')

I read them in file as

  import gzip
  f = open('/tmp/p', 'w')
  k.get_file(f)
  f.close()
  r = gzip.open('/tmp/p', 'rb')
  file_content = r.read()
  r.close()

Question

How can I ungzip the streams directly and read the contents?

I do not want to create temp files, they don't look good.

Martijn Pieters · Accepted Answer · 2023-06-17T19:04:09.103

40

Yes, you can use the zlib module to decompress byte streams:

import zlib

def stream_gzip_decompress(stream):
    dec = zlib.decompressobj(32 + zlib.MAX_WBITS)  # offset 32 to skip the header
    for chunk in stream:
        rv = dec.decompress(chunk)
        if rv:
            yield rv
    if dec.unused_data:
        # decompress and yield the remainder
        yield dec.flush()

The offset of 32 signals to the zlib header that the gzip header is expected but skipped.

The S3 key object is an iterator, so you can do:

for data in stream_gzip_decompress(k):
    # do something with the decompressed data

edited Jun 17 '23 at 19:04

answered Sep 24 '12 at 20:00

Martijn Pieters

1,048,767
296
4,058
3,343

Would this need a call to`dec.flush()` at the end to make sure to not miss any data? – Michal Charemza Jun 16 '23 at 10:10
1

@MichalCharemza: yes, that's a very good point. Add `if dec.unused_data: yield dec.flush()` to the end. – Martijn Pieters Jun 17 '23 at 19:03

score 10 · Answer 2 · answered Oct 18 '12 at 14:41

10

I had to do the same thing and this is how I did it:

import gzip
f = StringIO.StringIO()
k.get_file(f)
f.seek(0) #This is crucial
gzf = gzip.GzipFile(fileobj=f)
file_content = gzf.read()

answered Oct 18 '12 at 14:41

Alex

441
5
6

3

what is `k` here? – Alex R Dec 23 '20 at 23:29

score 6 · Answer 3 · answered Sep 26 '17 at 21:04

For Python3x and boto3-

So I used BytesIO to read the compressed file into a buffer object, then I used zipfile to open the decompressed stream as uncompressed data and I was able to get the datum line by line.

import io
import zipfile
import boto3
import sys

s3 = boto3.resource('s3', 'us-east-1')


def stream_zip_file():
    count = 0
    obj = s3.Object(
        bucket_name='MonkeyBusiness',
        key='/Daily/Business/Banana/{current-date}/banana.zip'
    )
    buffer = io.BytesIO(obj.get()["Body"].read())
    print (buffer)
    z = zipfile.ZipFile(buffer)
    foo2 = z.open(z.infolist()[0])
    print(sys.getsizeof(foo2))
    line_counter = 0
    for _ in foo2:
        line_counter += 1
    print (line_counter)
    z.close()


if __name__ == '__main__':
    stream_zip_file()

I noticed that the memory consumption increases significantly when we do `buffer = io.BytesIO(obj.get()["Body"].read())`. However `read(1024)` reading a certain portion of the data keeps the memory usage low! — user 923227, Mar 19 '18 at 21:52
`buffer = io.BytesIO(obj.get()["Body"].read())` reads the whole file into memory. — Kirk Broadhurst, May 11 '18 at 18:39

score 0 · Answer 4 · answered Sep 24 '12 at 21:01

0

You can try PIPE and read contents without downloading file

    import subprocess
    c = subprocess.Popen(['-c','zcat -c <gzip file name>'], shell=True, stdout=subprocess.PIPE,         stderr=subprocess.PIPE)
    for row in c.stdout:
      print row

In addition "/dev/fd/" + str(c.stdout.fileno()) will provide you FIFO file name (Named pipe) which can be passed to other program.

answered Sep 24 '12 at 21:01

Priti Getkewar Joshi

789
5
4

but you would you pass to `zcat` the zipped bytes coming from S3 ? – Ciprian Tomoiagă Apr 07 '17 at 01:28
`zcat` is not particularly portable, you'd better use `gunzip -c` – Eli Korvigo Jan 04 '18 at 19:04
5

Shelling out to do this with all the overhead of process setup and the like is absolutely the wrong thing to do. – Chris Withers Feb 08 '19 at 18:37
1

I absolutely agree, CW. But I love to see it offered. – mohawkTrail Jun 30 '20 at 17:44

Python ungzipping stream of bytes?

4 Answers4

Linked