0

I need to routinely access and parse XML data from a website of the form:

https://api.website.com/stuff/getCurrentData?security_key=blah

I cannot post the actual connections because of the secure nature of the data. When I put this url into my browser (Safari), I get XML returned.

When I call this through urllib2, I get junk.

f = urllib2.urlopen("https://api.website.com/stuff/getCurrentData?security_key=blah") 
s = f.read()
f.close()
s
'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xc5\x96mo\xda0\x10\xc7\xdf\xf7SX\xbc\xda4\x15\xc7y\x00R\xb9\xae\xfa\xb4U\x1a-\x150M{5y\xe1\x06V\x13\x079\x0e\x14>\xfd\x9c\x84\xb0\xd2\xa4S\xa4L\xe5\x95\xef\xeeo 

This post Urllib's urlopen breaking on some sites (e.g. StackApps api): returns garbage results seems to be a similar problem, but it refers to JSON instead of XML. Following the instructions to look at headers, I think that I am getting GZIP data returned. {I did the test suggested, posted here}

req = urllib2.Request("https://api.website.com/stuff/getCurrentData?security_key=blah",
                      headers={'Accept-Encoding': 'gzip, identity'})
conn = urllib2.urlopen(req)
val = conn.read()
conn.close()
val[0:25]
'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xc5\x96]o\xda0\x14\x86\xef\xfb+,\xae6M'

In that post, there was some suggestion that this could be a local problem, so I tried an example site.

f = urllib2.urlopen("http://www.python.org")
s = f.read()
f.close()
s
'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n\n\n<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">\n\n<head>\n  <meta http-equiv="content-type" content="text/html; charset=utf-8" />\n  <title>Python Programming Language &ndash; Official Website</title>\n  

This works just fine, so I think it has something to do with the site API that I am actually trying to access.

This post Why does text retrieved from pages sometimes look like gibberish? suggested that I might need to do something with "Selenium" but then the poster said the problem "fixed itself" which does not help me figure out what is wrong.

Am I not able to use python to download secure data? Do I need to use something different besides urlib2 and url open?

I am running python 2.7 on Mac OSX 10.7.5

Community
  • 1
  • 1
jessi
  • 1,438
  • 1
  • 23
  • 36

2 Answers2

2

You are retrieving GZIPped, compressed data; the server expressly tells you it does with Content-Encoding: gzip. Either use the zlib library to decompress the data:

import zlib

decomp = zlib.decompressobj(16 + zlib.MAX_WBITS)
data = decomp.decompress(val)

or use a library that supports transparent decompression if the response headers indicate compression has been used, like requests.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • EXCELLENT! The zlib works, but I'll go track down requests for perhaps a cleaner solution. I tried earlier to understand requests, but I was not entirely sure that would be a better approach than `urllib2` and `urlopen`. – jessi May 31 '13 at 14:27
  • @Jessi: `requests` is *far* cleaner than `urllib2`. :-) – Martijn Pieters May 31 '13 at 14:28
1

'\x1f\x8b\' is indeed the magic header for gzip, so you are getting gzip data back.

In your second example you explicitly accept gzip encoded data, change that to 'Accept-Encoding': 'identity' and see if it makes a difference.

Lennart Regebro
  • 167,292
  • 41
  • 224
  • 251
  • When I get rid of gzip in the accept, @LennartRegebro, I get the same stuff. `req = urllib2.Request("https://api.website.us/stuff/getCurrentData?security_key=blah", headers={'Accept-Encoding': 'identity'}) conn = urllib2.urlopen(req) val = conn.read() conn.close() val[0:25] '\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xc5\x96Qo\xda0\x10\xc7\xdf\xfb),\x9e6M'` – jessi May 31 '13 at 14:15
  • I thought they looked familiar @MartijnPieters. Here is what happens. `conn.info().headers ['Date: Fri, 31 May 2013 14:13:01 GMT\r\n', 'Server: Apache/2.2.14 (Ubuntu)\r\n', 'X-Powered-By: PHP/5.3.2-1ubuntu4.18\r\n', 'Content-Encoding: gzip\r\n', 'Content-Length: 645\r\n', 'Connection: close\r\n', 'Content-Type: text/xml\r\n']` So, it definitely looks like gzip – jessi May 31 '13 at 14:16
  • 1
    @Jessi: It's hard to say without testing, but it seems to me the server is broken and will always send gzip. You'll have to unzip it. http://stackoverflow.com/questions/3947120/does-python-urllib2-will-automaticly-uncompress-gzip-data-from-fetch-webpage (or even better, as Martijn suggested, use requests). – Lennart Regebro May 31 '13 at 14:17
  • Thanks for clarifying. I can use Martijn's suggestion. – jessi May 31 '13 at 14:28