10

I am trying to read a gzipped XML file that I request via requests. Everything that I have read indicates that the the uncompressing should happen automatically.

#!/usr/bin/python

from __future__ import unicode_literals
import requests

if __name__ == '__main__':

    url = 'http://rdf.dmoz.org/rdf/content.rdf.u8.gz'

    headers = {
        'Accept-Encoding': "gzip,x-gzip,deflate,sdch,compress",
        'Accept-Content': 'gzip',
        'HTTP-Connection': 'keep-alive',
        'Accept-Language': "en-US,en;q=0.8",
    }

    request_reply = requests.get(url, headers=headers)

    print request_reply.headers

    request_reply.encoding = 'utf-8'
    print request_reply.text[:200]
    print request_reply.content[:200]

The header in my first line of output looks like this:

{'content-length': '260071268', 'accept-ranges': 'bytes', 'keep-alive': 'timeout=5, max=100', 'server': 'Apache', 'connection': 'Keep-Alive', 'date': 'Tue, 08 Sep 2015 16:27:49 GMT', 'content-type': 'application/x-gzip'}

The next two output lines appear to be binary, where I was expecting XML text:

�Iɒ(�����~ؗool���u�rʹ�J���io�   a2R1��ߞ|�<����_��������Ҽҿ=�Z����onnz7�{JO���}h�����6��·��>,aҚ>��hZ6�u��x���?y�_�.y�$�Բ
�Iɒ(�����~ؗool���u�rʹ�J���io�   a2R1��ߞ|�<����_��������Ҽҿ=�Z����onnz7�{JO��}h�����6��·��>,aҚ>��hZ6�u��x���

I think part of the problem is that site-packages/requests/packages/urllib3/response.py does not recognize gzip unless the header has 'content-encoding': 'gzip'

I was able to get the results I wanted by adding 4 lines to a method in response.py like so:

    def _init_decoder(self):
        """
        Set-up the _decoder attribute if necessar.
        """
        # Note: content-encoding value should be case-insensitive, per RFC 7230
        # Section 3.2
        content_encoding = self.headers.get('content-encoding', '').lower()
        if self._decoder is None and content_encoding in self.CONTENT_DECODERS:
            self._decoder = _get_decoder(content_encoding)

        # My added code below this comment
            return
        content_type = self.headers.get('content-type', '').lower()
        if self._decoder is None and content_type == 'application/x-gzip':
            self._decoder = _get_decoder('gzip')

But, is there a better way?

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
user2367072
  • 103
  • 1
  • 5

1 Answers1

12

You misunderstood. Only transport-level compression is taken care of automatically, so compression applied by the HTTP server.

You have compressed content. Since this wasn't applied just for the HTTP transport stage, requests won't remove it either.

requests communicates to the server that it accepts compressed responses by sending Accept-Encoding: gzip, deflate with every request sent. The server can then respond by compressing the whole response body and adding a Content-Encoding header indicating the compression used.

Your response has no Content-Encoding header, nor would applying compression again make sense here.

Most of the time you want to download an already compressed archive like the DMOZ RDF dataset in the compressed form, anyway. You requested a compressed archive after all. It is not the job of the requests library to decode that.

In Python 3 you can handle decoding as a stream by using the gzip module and streaming the response:

import gzip
import requests
import shutil

r = requests.get(url, stream=True)
if r.status_code == 200:
    with open(path, 'wb') as f:
        r.raw.decode_content = True  # just in case transport encoding was applied
        gzip_file = gzip.GzipFile(fileobj=r.raw)
        shutil.copyfileobj(gzip_file, f)

where you could use an RDF parser instead of copying the decompressed data to disk, of course.

Unfortunately the Python 2 implementation of the module requires a seekable file; you can create your own streaming wrapper, or by adding that _decoder attribute to the r.raw object above.

Community
  • 1
  • 1
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343