15

I’m playing around with the Stack Overflow API using Python. I’m trying to decode the gzipped responses that the API gives.

import urllib, gzip

url = urllib.urlopen('http://api.stackoverflow.com/1.0/badges/name')
gzip.GzipFile(fileobj=url).read()

According to the urllib2 documentation, urlopen “returns a file-like object”.

However, when I run read() on the GzipFile object I’ve created using it, I get this error:

AttributeError: addinfourl instance has no attribute 'tell'

As far as I can tell, this is coming from the object returned by urlopen.

It doesn’t appear to have seek either, as I get an error when I do this:

url.read()
url.seek(0)

What exactly is this object, and how do I create a functioning GzipFile instance from it?

Paul D. Waite
  • 96,640
  • 56
  • 199
  • 270
  • 1
    `Content-Encoding: gzip` should be handled by the http library, but unfortunately it isn't. This is [issue 9500](http://bugs.python.org/issue9500) in Python's bug database, for the interested. – Magnus Hoff Nov 17 '10 at 14:09
  • @Magnus: cheers, good to know it’s at least in the bug tracker. – Paul D. Waite Nov 17 '10 at 14:28

3 Answers3

10

The urlopen docs list the supported methods of the object that is returned. I recommend wrapping the object in another class that supports the methods that gzip expects.

Other option: call the read method of the response object and put the result in a StringIO object (which should support all methods that gzip expects). This maybe a little more expensive though.

E.g.

import gzip
import json
import StringIO
import urllib

url = urllib.urlopen('http://api.stackoverflow.com/1.0/badges/name')
url_f = StringIO.StringIO(url.read())
g = gzip.GzipFile(fileobj=url_f)
j = json.load(g)
hd1
  • 33,938
  • 5
  • 80
  • 91
stefanw
  • 10,456
  • 3
  • 36
  • 34
  • Wrapping it in a `StringIO` object gets past that error, but I still get an `IOError: Not a gzipped file` – Thomas K Nov 17 '10 at 13:16
  • 1
    @ThomasK It works find for me. Are you passing `url.read()` to the `StringIO` constructor or just `url`? The latter fails. – aaronasterling Nov 17 '10 at 13:21
  • Excellent, cheers. Unutbu’s answer was great too, but I’ll go with this one as I’m guessing the `StringIO` solution is more backwards compatible. – Paul D. Waite Nov 17 '10 at 14:49
  • 2
    Is there a way to do this without reading the entire `urlopen` response in one go? I'm looking to use something like this in a situation where the payload of the `urlopen` is very large (GBs), so I would like to be able to use this to stream-parse as data comes in, rather than blocking on the whole http request. – Kevin Oct 19 '15 at 15:21
8
import urllib2
import json
import gzip
import io

url='http://api.stackoverflow.com/1.0/badges/name'
page=urllib2.urlopen(url)
gzip_filehandle=gzip.GzipFile(fileobj=io.BytesIO(page.read()))
json_data=json.loads(gzip_filehandle.read())
print(json_data)

io.BytesIO is for Python2.6+. For older versions of Python, you could use cStringIO.StringIO.

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
0

Here is a new update for @stefanw's answer, to whom that might think it too expensive to use that much memory.

Thanks to this article(https://www.enricozini.org/blog/2011/cazzeggio/python-gzip/, it explains why gzip doesn't work), the solution is to use Python3.

import urllib.request
import gzip

response = urllib.request.urlopen('http://api.stackoverflow.com/1.0/badges/name')
with gzip.GzipFile(fileobj=response) as f:
    for line in f:
        print(line)
CKLu
  • 68
  • 7