6

I tried parsing a web page using urllib.request's urlopen() method, like:

from urllib.request import Request, urlopen
req = Request(url)
html = urlopen(req).read()

However, the last line returned the result in bytes.

So I tried decoding it, like:

html = urlopen(req).read().decode("utf-8")

However, the error occurred:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte.

With some research, I found one related answer, which parses charset to decide the decode. However, the page doesn't return the charset, and when I tried checking it on Chrome Web Inspector, the following line was written in its header:

<meta charset="utf-8">

So why can I not decode it with utf-8? And how can I parse the web page successfully?

The web site URL is http://www.vogue.com/fashion-shows/fall-2016-menswear/fendi/slideshow/collection#2, where I want to save the image to my disk.

Note that I use Python 3.5.1. I also note that all the work I wrote above have functioned well in my other scraping programs.

Community
  • 1
  • 1
Blaszard
  • 30,954
  • 51
  • 153
  • 233

1 Answers1

14

The content is compressed with gzip. You need to decompress it:

import gzip
from urllib.request import Request, urlopen

req = Request(url)
html = gzip.decompress(urlopen(req).read()).decode('utf-8')

If you use requests, it will uncompress automatically for you:

import requests
html = requests.get(url).text  # => str, not bytes
falsetru
  • 357,413
  • 63
  • 732
  • 636