I'm getting an URL from Schema.org. It's content-type="text/html"
Sometimes, read() functions as expected b'< !DOCTYPE html> ....'
Sometimes, read() returns something else b'\x1f\x8b\x08\x00\x00\x00\x00 ...'
try:
with urlopen("http://schema.org/docs/releases.html") as f:
txt = f.read()
except URLError:
return
I've tried solving this with txt = f.read().decode("utf-8").encode()
but this results in an error... sometimes: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
The obvious work-around is to test if the first byte is hex and treat this accordingly.
My question is: Is this a bug or something else?
Edit
Related question. Apparently, sometimes I'm getting a gzipped stream.
Lastly I solved this by adding the following code as proposed here
if 31 == txt[0]:
txt = decompress(txt, 16+MAX_WBITS)
The question remains; why does this return text/html sometimes and zipped some other times?