urllib.request.urlopen return bytes, but I cannot decode it

Question

I tried parsing a web page using urllib.request's urlopen() method, like:

from urllib.request import Request, urlopen
req = Request(url)
html = urlopen(req).read()

However, the last line returned the result in bytes.

So I tried decoding it, like:

html = urlopen(req).read().decode("utf-8")

However, the error occurred:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte.

With some research, I found one related answer, which parses charset to decide the decode. However, the page doesn't return the charset, and when I tried checking it on Chrome Web Inspector, the following line was written in its header:

<meta charset="utf-8">

So why can I not decode it with utf-8? And how can I parse the web page successfully?

The web site URL is http://www.vogue.com/fashion-shows/fall-2016-menswear/fendi/slideshow/collection#2, where I want to save the image to my disk.

Note that I use Python 3.5.1. I also note that all the work I wrote above have functioned well in my other scraping programs.

score 14 · Accepted Answer · answered Feb 01 '16 at 02:45

14

The content is compressed with gzip. You need to decompress it:

import gzip
from urllib.request import Request, urlopen

req = Request(url)
html = gzip.decompress(urlopen(req).read()).decode('utf-8')

If you use requests, it will uncompress automatically for you:

import requests
html = requests.get(url).text  # => str, not bytes

answered Feb 01 '16 at 02:45

falsetru

357,413
63
732
636

3

Thanks. Can you share how you can get to know it is gzip? – Blaszard Feb 01 '16 at 02:53
@Blaszard the duplicate has that – ivan_pozdeev Feb 01 '16 at 02:55
3

@Blaszard, `urlopen(req).info()['content-encoding']` – falsetru Feb 01 '16 at 03:38

urllib.request.urlopen return bytes, but I cannot decode it

1 Answers1