I am trying to read a gzipped XML file that I request via requests. Everything that I have read indicates that the the uncompressing should happen automatically.
#!/usr/bin/python
from __future__ import unicode_literals
import requests
if __name__ == '__main__':
url = 'http://rdf.dmoz.org/rdf/content.rdf.u8.gz'
headers = {
'Accept-Encoding': "gzip,x-gzip,deflate,sdch,compress",
'Accept-Content': 'gzip',
'HTTP-Connection': 'keep-alive',
'Accept-Language': "en-US,en;q=0.8",
}
request_reply = requests.get(url, headers=headers)
print request_reply.headers
request_reply.encoding = 'utf-8'
print request_reply.text[:200]
print request_reply.content[:200]
The header in my first line of output looks like this:
{'content-length': '260071268', 'accept-ranges': 'bytes', 'keep-alive': 'timeout=5, max=100', 'server': 'Apache', 'connection': 'Keep-Alive', 'date': 'Tue, 08 Sep 2015 16:27:49 GMT', 'content-type': 'application/x-gzip'}
The next two output lines appear to be binary, where I was expecting XML text:
�Iɒ(�����~ؗool���u�rʹ�J���io� a2R1��ߞ|�<����_��������Ҽҿ=�Z����onnz7�{JO���}h�����6��·��>,aҚ>��hZ6�u��x���?y�_�.y�$�Բ
�Iɒ(�����~ؗool���u�rʹ�J���io� a2R1��ߞ|�<����_��������Ҽҿ=�Z����onnz7�{JO��}h�����6��·��>,aҚ>��hZ6�u��x���
I think part of the problem is that site-packages/requests/packages/urllib3/response.py
does not recognize gzip unless the header has 'content-encoding': 'gzip'
I was able to get the results I wanted by adding 4 lines to a method in response.py
like so:
def _init_decoder(self):
"""
Set-up the _decoder attribute if necessar.
"""
# Note: content-encoding value should be case-insensitive, per RFC 7230
# Section 3.2
content_encoding = self.headers.get('content-encoding', '').lower()
if self._decoder is None and content_encoding in self.CONTENT_DECODERS:
self._decoder = _get_decoder(content_encoding)
# My added code below this comment
return
content_type = self.headers.get('content-type', '').lower()
if self._decoder is None and content_type == 'application/x-gzip':
self._decoder = _get_decoder('gzip')
But, is there a better way?