UnicodeDecodeError downloading HTML using Python

Question

I've just started to learn Python, but when I want to write a tool to help me download the online book "Learn Vimscript The Hard Way", I have a problem.

This is my code; the version is py3.5:

#coding: utf-8
import urllib.request
import re

url = 'http://learnvimscriptthehardway.stevelosh.com'
name = '/chapters/16.html'
while(len(name) != 0):
    url1 = url + name 
    print(url1)
    response = urllib.request.urlopen(url1)
    vim = response.read().decode('utf-8')
    address = "/Users/zhangzhimin/learnvimthehardway/" + name[-2:] + ".html"
    with open(address, "w") as f:
        f.write(vim)
    print("%s finish" % name)
    x = re.findall('''<a class="next" href="(.+?)"''', vim)
    name = x[0]

This is the result:

:!python3 test.py
http://learnvimscriptthehardway.stevelosh.com/chapters/16.html
/chapters/16.html finish
http://learnvimscriptthehardway.stevelosh.com/chapters/17.html
Traceback (most recent call last):
  File "test.py", line 11, in <module>
    vim = response.read().decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

I don't know why this happens: I can download chapter 16 and decode it but I can't do the same thing for chapter 17.

byte 0x8b in position 1 usually signals that the data stream is gzipped. Take a look [here](http://stackoverflow.com/a/13483961) — algor, Jan 20 '16 at 17:51
why do you decode? open the file to write as bytes and just write the bytes you got. oh, I see, you parse the file later.. — Ale, Jan 20 '16 at 19:35
Possible duplicate of [urllib2 opener providing wrong charset](http://stackoverflow.com/questions/9445627/urllib2-opener-providing-wrong-charset) — ivan_pozdeev, Jan 21 '16 at 12:39
Consider using `requests`: the library transparently decodes `transfer-encoding`. And yes, [an HTML parser to parse HTML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). — ivan_pozdeev, Jan 21 '16 at 12:46

Leo Skhrnkv · Answer 1 · 2016-01-22T21:57:45.000

Please see example that works:

import urllib2
import re

name = '/chapters/16.html'
url = 'http://learnvimscriptthehardway.stevelosh.com'
while len(name) > 0:
    url1 = url + name
    response = urllib2.urlopen(url1)
    data = response.read()
    address = './vim/' + name[-7:]
    with open(address, 'w') as fh:
        fh.write(data)
    x = re.findall('''<a class="next" href="(.+?)"''', data)
    if x:
        name = x[0]
    else:
        break

I am using Python 2.7.10 though. This code downloads all chapters in html format from the url you specified. Notes: replace './vim/' (current dir + vim) for your directory; I used name[-7:], which is last 7 chars, like '16.html' and so on. Conditional 'if' (if x: ...) precludes 'index out of range' error.

This works for you in Python 2.7 because you're taking the encoded HTML and writing it straight to disk without decoding. Your `str()` around `response.read()` is unnecessary. — Alastair McCormack, Jan 22 '16 at 21:46

score 0 · Accepted Answer · answered Jan 23 '16 at 16:37

Finally I solved this problem, in fact, everything in my code is ok except considering gzip, I should thinks the guy who remind me that :

byte 0x8b in position 1 usually signals that the data stream is gzipped.

After I use gzip module in my code, everything works ok.

UnicodeDecodeError downloading HTML using Python

2 Answers2