0

I'm trying to get an HTML output from a webpage.

data = response.read()

gives me something like that:

b'\x1f\x8b\x08\x00\x00\x00\x00\...

How can I convert those characters into something like:

"<html><body>.."

?

Cœur
  • 37,241
  • 25
  • 195
  • 267
user1641071
  • 387
  • 1
  • 7
  • 16

1 Answers1

2

You are dealing with a gzipped response. You can verify this by checking the Content-Encoding response header, or writing the beginning of that byte sequence to a file and check its type with the file utility if you're on a Unix-like platform:

>>> data =  '\x1f\x8b\x08\x00\x00\x00\x00'
>>> f = open('data.bin', 'w')
>>> f.write(data)
>>> f.close()
$ file data.bin
data.bin: gzip compressed data, last modified: Thu Jun 16 09:32:16 1994

You could decode it yourself, but I suggest ditching urllib for the requests module which automatically decompresses it:

import requests
response = requests.get(url)
print response.content
Community
  • 1
  • 1
Lukas Graf
  • 30,317
  • 8
  • 77
  • 92
  • I think it should be `open('data.bin', 'wb')` so it accepts binary data. Or am I missing something? –  Jul 08 '14 at 23:28
  • 1
    @bvukelic that's true on Windows, I always forget about that... odd "feature". It doesn't make any difference though for any other platform. And since the [`file`](http://linux.die.net/man/1/file) utility doesn't exist on Windows (to my knowledge), it doesn't really make a difference in this case. – Lukas Graf Jul 08 '14 at 23:40
  • That's interesting, I didn't know about that 'feature'. I just assumed it would be the same on all platforms since it's implied in the docs. I never thought of trying it without 'b' on Linux. https://docs.python.org/3.3/tutorial/inputoutput.html#reading-and-writing-files –  Jul 08 '14 at 23:44
  • 1
    As far as I understand it's like this: There's no such thing as "binary data". There's just data. But for some reason, if you don't specify the binary mode, the Windows implementation of Python thinks it would be a hilarious idea to mess with newlines (feeding you different data than what's actually in the file, or writing something different from what you got in memory). – Lukas Graf Jul 08 '14 at 23:47