74

I'm using

 data=urllib2.urlopen(url).read()

I want to know:

  1. How can I tell if the data at a URL is gzipped?

  2. Does urllib2 automatically uncompress the data if it is gzipped? Will the data always be a string?

mhlester
  • 22,781
  • 10
  • 52
  • 75
mlzboy
  • 14,343
  • 23
  • 76
  • 97
  • 3
    Maybe worth noting that the [`requests` library](http://www.python-requests.org/) handles gzip compression automatically (see [the FAQ](http://www.python-requests.org/en/latest/community/faq/#encoded-data)) – dbr Aug 03 '13 at 09:39

4 Answers4

153
  1. How can I tell if the data at a URL is gzipped?

This checks if the content is gzipped and decompresses it:

from StringIO import StringIO
import gzip

request = urllib2.Request('http://example.com/')
request.add_header('Accept-encoding', 'gzip')
response = urllib2.urlopen(request)
if response.info().get('Content-Encoding') == 'gzip':
    buf = StringIO(response.read())
    f = gzip.GzipFile(fileobj=buf)
    data = f.read()
  1. Does urllib2 automatically uncompress the data if it is gzipped? Will the data always be a string?

No. The urllib2 doesn't automatically uncompress the data because the 'Accept-Encoding' header is not set by the urllib2 but by you using: request.add_header('Accept-Encoding','gzip, deflate')

Jay Taylor
  • 13,185
  • 11
  • 60
  • 85
ars
  • 120,335
  • 23
  • 147
  • 134
  • 2
    bobince has a point, urllib2 would not be sending the appropriate headers, so the response will not be gzipped. – daniyalzade Jul 08 '11 at 19:48
  • @daysleeper: good point indeed, I'd forgotten to include the accept header. I've modified the code now. Thanks. – ars Jul 08 '11 at 20:04
  • 9
    In Py3k use io.BytesIO instead of StrinIO.StringIO! – phobie Jul 30 '12 at 13:40
  • I'm setting the 'Accept-encoding': 'gzip' header in my request, but my response doesn't seem to have any 'Content-Encoding' header set. The funny part is that I can see that compression is happening, since the data is less. Moreover, when I use firebug in Firefox, the response seem to have 'Content-Encoding' set. I don't know what is happening. – VaidAbhishek Jul 08 '13 at 13:13
  • 2
    Relevant: Why you can't stream urllib into gzip http://www.enricozini.org/2011/cazzeggio/python-gzip/ – Sam Jul 26 '13 at 23:01
  • What would this solution look like in Python3? – tommy.carstensen Apr 26 '15 at 20:01
  • @Sam: it is fixed e.g., [this code works](http://stackoverflow.com/a/26435241/4279). – jfs Jun 11 '15 at 11:41
  • 1
    @tommy.carstensen: here's [Python 3 code example](http://stackoverflow.com/a/26435241/4279) – jfs Jun 11 '15 at 11:42
  • 2
    @daniyalzade I'm working with a website that gzipped the response even though the request did not specify it. – Eyal May 12 '16 at 07:28
  • yep, steemit.com is one that does this. – jcomeau_ictx Sep 05 '16 at 19:48
8

If you are talking about a simple .gz file, no, urllib2 will not decode it, you will get the unchanged .gz file as output.

If you are talking about automatic HTTP-level compression using Content-Encoding: gzip or deflate, then that has to be deliberately requested by the client using an Accept-Encoding header.

urllib2 doesn't set this header, so the response it gets back will not be compressed. You can safely fetch the resource without having to worry about compression (though since compression isn't supported the request may take longer).

bobince
  • 528,062
  • 107
  • 651
  • 834
  • 4
    This doesn't seem to be true for all popular servers. Try `curl -vI http://en.wikipedia.org/wiki/Spanish_language |& grep '^[<>]'` – Andres Riofrio May 17 '13 at 09:41
5

Your question has been answered, but for a more comprehensive implementation, take a look at Mark Pilgrim's implementation of this, it covers gzip, deflate, safe URL parsing and much, much more, for a widely-used RSS parser, but nevertheless a useful reference.

RuiDC
  • 8,403
  • 7
  • 26
  • 21
-1

It appears urllib3 handles this automatically now.

Reference headers:

HTTPHeaderDict({'ETag': '"112d13e-574c64196bcd9-gzip"', 'Vary': 'Accept-Encoding', 'Content-Encoding': 'gzip', 'X-Frame-Options': 'sameorigin', 'Server': 'Apache', 'Last-Modified': 'Sat, 01 Sep 2018 02:42:16 GMT', 'X-Content-Type-Options': 'nosniff', 'X-XSS-Protection': '1; mode=block', 'Content-Type': 'text/plain; charset=utf-8', 'Strict-Transport-Security': 'max-age=315360000; includeSubDomains', 'X-UA-Compatible': 'IE=edge', 'Date': 'Sat, 01 Sep 2018 14:20:16 GMT', 'Accept-Ranges': 'bytes', 'Transfer-Encoding': 'chunked'})

Reference code:

import gzip
import io
import urllib3

class EDDBMultiDataFetcher():
    def __init__(self):
        self.files_dict = {
            'Populated Systems':'http://eddb.io/archive/v5/systems_populated.jsonl',
            'Stations':'http://eddb.io/archive/v5/stations.jsonl',
            'Minor factions':'http://eddb.io/archive/v5/factions.jsonl',
            'Commodities':'http://eddb.io/archive/v5/commodities.json'
            }
        self.http = urllib3.PoolManager()
    def fetch_all(self):
        for item, url in self.files_dict.items():
            self.fetch(item, url)

    def fetch(self, item, url, save_file = None):
        print("Fetching: " + item)
        request = self.http.request(
            'GET',
            url,
            headers={
                'Accept-encoding': 'gzip, deflate, sdch'
                })
        data = request.data.decode('utf-8')
        print("Fetch complete")
        print(data)
        print(request.headers)
        quit()


if __name__ == '__main__':
    print("Fetching files from eddb.io")
    fetcher = EDDBMultiDataFetcher()
    fetcher.fetch_all()
RobotHumans
  • 807
  • 10
  • 25