Python3 urlopen read weirdness (gzip)

Question

I'm getting an URL from Schema.org. It's content-type="text/html"

Sometimes, read() functions as expected b'< !DOCTYPE html> ....'

Sometimes, read() returns something else b'\x1f\x8b\x08\x00\x00\x00\x00 ...'

try:
    with urlopen("http://schema.org/docs/releases.html") as f:
        txt = f.read()
except URLError:
    return

I've tried solving this with txt = f.read().decode("utf-8").encode() but this results in an error... sometimes: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

The obvious work-around is to test if the first byte is hex and treat this accordingly.

My question is: Is this a bug or something else?

Edit Related question. Apparently, sometimes I'm getting a gzipped stream.

Lastly I solved this by adding the following code as proposed here

if 31 == txt[0]:
    txt = decompress(txt, 16+MAX_WBITS)

The question remains; why does this return text/html sometimes and zipped some other times?

I see your edit now. When you receive a zipped stream, you'll obviously have to unzip it first. You can probably avoid getting a zipped response by adding an "Accept" header. — Jaap Versteegh, Aug 25 '15 at 11:31
@SlashV Cool, I'll do that. I've seen a number of questions related to zipped streams on StackOverflow. Should I delete this Q? — GUI Junkie, Aug 25 '15 at 11:33
@SlashV Shouldn't that rather be [`Accept-Encoding`](https://tools.ietf.org/html/rfc2616#page-102)? — dhke, Aug 25 '15 at 11:41
@dhke yup. In particular I feel that `Accept-Encoding: identity` should help — Jaap Versteegh, Aug 25 '15 at 12:00
@SlashV Yeah. `urlopen()` doesn't specify `Accept-Encoding` at all, which the server previously [MAY](https://tools.ietf.org/html/rfc2616#section-14.3) interpret as `Accept-Encoding: *`. This has changed with [RFC7231](https://tools.ietf.org/html/rfc7231#section-5.3.4). From the question history, I'd really consider this a wiki-answer case. — dhke, Aug 25 '15 at 12:06
I cannot not get gzipped data from this url **even if** I specify `Accept-Encoding: gzip;q=1` or `Accept-Encoding: gzip, deflate`, so it seems to me this server has some rules of its own. — Jaap Versteegh, Aug 25 '15 at 15:05

Jaap Versteegh · Answer 1 · 2015-08-25T15:13:17.693

2

You are indeed receiving a gzipped response. You should be able to avoid it by:

from urllib import request
try:
    req = request.Request("http://schema.org/docs/releases.html")
    req.add_header('Accept-Encoding', 'identity;q=1')
    with request.urlopen(req) as f:
        txt = f.read()
except request.URLError:
    return

edited Aug 25 '15 at 15:13

answered Aug 25 '15 at 12:13

Jaap Versteegh

761
7
15

The work-around I've chosen is to decompress it... less code – GUI Junkie Aug 25 '15 at 14:02
@GUIJunkie nah, both are two lines **and** you'll have to do an extra import for the `decompress` ;) – Jaap Versteegh Aug 25 '15 at 15:10
:-) I also had to do an extra import for the Request. Shrugs. – GUI Junkie Aug 25 '15 at 15:22

score 2 · Accepted Answer · edited Oct 07 '21 at 11:04

There are other questions in this category, but I cannot find an answer that addresses the actual cause of the problem.

Python's urllib2.urlopen() cannot transparently handle compression. It also by default does not set the Accept-Encoding request header. Additionally, the interpretation of this situation according to the HTTP standard has changed in the past.

As per RFC2616:

If no Accept-Encoding field is present in a request, the server MAY assume that the client will accept any content coding. In this case, if "identity" is one of the available content-codings, then the server SHOULD use the "identity" content-coding, unless it has additional information that a different content-coding is meaningful to the client.

Unfortunately (as for the use case), RFC7231 changes this to

If no Accept-Encoding field is in the request, any content-coding is considered acceptable by the user agent.

Meaning, when performing a request using urlopen() you can get a response in whatever encoding the server decides to use and the response will be conformant.

schema.org seems to be hosted by google, i.e. it is most likely behind a distributed frontend load balancer network. So the different answers you get might be returned from load balancers with slightly different configurations.

Google Engineers have in the past advocated for the use HTTP compression, so this might as well be a conscious decision.

So as a lesson: when using urlopen() we need to set Accept-Encoding.

I suspected something like that. A load balancer seems the reasonable explanation. Cheers. — GUI Junkie, Aug 25 '15 at 14:00

Python3 urlopen read weirdness (gzip)

2 Answers2