Does python urllib2 automatically uncompress gzip data fetched from webpage?

Question

I'm using

 data=urllib2.urlopen(url).read()

I want to know:

How can I tell if the data at a URL is gzipped?
Does urllib2 automatically uncompress the data if it is gzipped? Will the data always be a string?

Maybe worth noting that the [`requests` library](http://www.python-requests.org/) handles gzip compression automatically (see [the FAQ](http://www.python-requests.org/en/latest/community/faq/#encoded-data)) — dbr, Aug 03 '13 at 09:39

score 153 · Accepted Answer · edited Sep 15 '16 at 16:51

153

How can I tell if the data at a URL is gzipped?

This checks if the content is gzipped and decompresses it:

from StringIO import StringIO
import gzip

request = urllib2.Request('http://example.com/')
request.add_header('Accept-encoding', 'gzip')
response = urllib2.urlopen(request)
if response.info().get('Content-Encoding') == 'gzip':
    buf = StringIO(response.read())
    f = gzip.GzipFile(fileobj=buf)
    data = f.read()

Does urllib2 automatically uncompress the data if it is gzipped? Will the data always be a string?

No. The urllib2 doesn't automatically uncompress the data because the 'Accept-Encoding' header is not set by the urllib2 but by you using: request.add_header('Accept-Encoding','gzip, deflate')

edited Sep 15 '16 at 16:51

Jay Taylor

13,185
11
60
85

answered Oct 16 '10 at 01:21

ars

120,335
23
147
134

2

bobince has a point, urllib2 would not be sending the appropriate headers, so the response will not be gzipped. – daniyalzade Jul 08 '11 at 19:48
@daysleeper: good point indeed, I'd forgotten to include the accept header. I've modified the code now. Thanks. – ars Jul 08 '11 at 20:04
9

In Py3k use io.BytesIO instead of StrinIO.StringIO! – phobie Jul 30 '12 at 13:40
I'm setting the 'Accept-encoding': 'gzip' header in my request, but my response doesn't seem to have any 'Content-Encoding' header set. The funny part is that I can see that compression is happening, since the data is less. Moreover, when I use firebug in Firefox, the response seem to have 'Content-Encoding' set. I don't know what is happening. – VaidAbhishek Jul 08 '13 at 13:13
2

Relevant: Why you can't stream urllib into gzip http://www.enricozini.org/2011/cazzeggio/python-gzip/ – Sam Jul 26 '13 at 23:01
What would this solution look like in Python3? – tommy.carstensen Apr 26 '15 at 20:01
@Sam: it is fixed e.g., [this code works](http://stackoverflow.com/a/26435241/4279). – jfs Jun 11 '15 at 11:41
1

@tommy.carstensen: here's [Python 3 code example](http://stackoverflow.com/a/26435241/4279) – jfs Jun 11 '15 at 11:42
2

@daniyalzade I'm working with a website that gzipped the response even though the request did not specify it. – Eyal May 12 '16 at 07:28
yep, steemit.com is one that does this. – jcomeau_ictx Sep 05 '16 at 19:48

score 8 · Answer 2 · answered Oct 16 '10 at 01:28

If you are talking about a simple .gz file, no, urllib2 will not decode it, you will get the unchanged .gz file as output.

If you are talking about automatic HTTP-level compression using Content-Encoding: gzip or deflate, then that has to be deliberately requested by the client using an Accept-Encoding header.

urllib2 doesn't set this header, so the response it gets back will not be compressed. You can safely fetch the resource without having to worry about compression (though since compression isn't supported the request may take longer).

This doesn't seem to be true for all popular servers. Try `curl -vI http://en.wikipedia.org/wiki/Spanish_language |& grep '^[<>]'` — Andres Riofrio, May 17 '13 at 09:41

RuiDC · Answer 3 · 2013-07-30T09:51:07.263

5

Your question has been answered, but for a more comprehensive implementation, take a look at Mark Pilgrim's implementation of this, it covers gzip, deflate, safe URL parsing and much, much more, for a widely-used RSS parser, but nevertheless a useful reference.

edited Jul 30 '13 at 09:51

answered Aug 09 '11 at 20:05

RuiDC

8,403
7
26
21

score -1 · Answer 4 · answered Sep 01 '18 at 14:23

It appears urllib3 handles this automatically now.

Reference headers:

HTTPHeaderDict({'ETag': '"112d13e-574c64196bcd9-gzip"', 'Vary': 'Accept-Encoding', 'Content-Encoding': 'gzip', 'X-Frame-Options': 'sameorigin', 'Server': 'Apache', 'Last-Modified': 'Sat, 01 Sep 2018 02:42:16 GMT', 'X-Content-Type-Options': 'nosniff', 'X-XSS-Protection': '1; mode=block', 'Content-Type': 'text/plain; charset=utf-8', 'Strict-Transport-Security': 'max-age=315360000; includeSubDomains', 'X-UA-Compatible': 'IE=edge', 'Date': 'Sat, 01 Sep 2018 14:20:16 GMT', 'Accept-Ranges': 'bytes', 'Transfer-Encoding': 'chunked'})

Reference code:

import gzip
import io
import urllib3

class EDDBMultiDataFetcher():
    def __init__(self):
        self.files_dict = {
            'Populated Systems':'http://eddb.io/archive/v5/systems_populated.jsonl',
            'Stations':'http://eddb.io/archive/v5/stations.jsonl',
            'Minor factions':'http://eddb.io/archive/v5/factions.jsonl',
            'Commodities':'http://eddb.io/archive/v5/commodities.json'
            }
        self.http = urllib3.PoolManager()
    def fetch_all(self):
        for item, url in self.files_dict.items():
            self.fetch(item, url)

    def fetch(self, item, url, save_file = None):
        print("Fetching: " + item)
        request = self.http.request(
            'GET',
            url,
            headers={
                'Accept-encoding': 'gzip, deflate, sdch'
                })
        data = request.data.decode('utf-8')
        print("Fetch complete")
        print(data)
        print(request.headers)
        quit()


if __name__ == '__main__':
    print("Fetching files from eddb.io")
    fetcher = EDDBMultiDataFetcher()
    fetcher.fetch_all()

urllib3 is a third-party package. It's not an upgrade of urllib2. — imba-tjd, May 22 '21 at 14:23

Does python urllib2 automatically uncompress gzip data fetched from webpage?

4 Answers4

Linked

Related