-1

I've just started to learn Python, but when I want to write a tool to help me download the online book "Learn Vimscript The Hard Way", I have a problem.

This is my code; the version is py3.5:

#coding: utf-8
import urllib.request
import re

url = 'http://learnvimscriptthehardway.stevelosh.com'
name = '/chapters/16.html'
while(len(name) != 0):
    url1 = url + name 
    print(url1)
    response = urllib.request.urlopen(url1)
    vim = response.read().decode('utf-8')
    address = "/Users/zhangzhimin/learnvimthehardway/" + name[-2:] + ".html"
    with open(address, "w") as f:
        f.write(vim)
    print("%s finish" % name)
    x = re.findall('''<a class="next" href="(.+?)"''', vim)
    name = x[0]

This is the result:

:!python3 test.py
http://learnvimscriptthehardway.stevelosh.com/chapters/16.html
/chapters/16.html finish
http://learnvimscriptthehardway.stevelosh.com/chapters/17.html
Traceback (most recent call last):
  File "test.py", line 11, in <module>
    vim = response.read().decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte                                                                                        

I don't know why this happens: I can download chapter 16 and decode it but I can't do the same thing for chapter 17.

Thomas Baruchel
  • 7,236
  • 2
  • 27
  • 46
zhangzhimin
  • 141
  • 1
  • 10
  • Is the webpage you dowliading actally encoded in utf-8? – Klaus D. Jan 20 '16 at 17:34
  • 1
    byte 0x8b in position 1 usually signals that the data stream is gzipped. Take a look [here](http://stackoverflow.com/a/13483961) – algor Jan 20 '16 at 17:51
  • why do you decode? open the file to write as bytes and just write the bytes you got. oh, I see, you parse the file later.. – Ale Jan 20 '16 at 19:35
  • Possible duplicate of [urllib2 opener providing wrong charset](http://stackoverflow.com/questions/9445627/urllib2-opener-providing-wrong-charset) – ivan_pozdeev Jan 21 '16 at 12:39
  • Consider using `requests`: the library transparently decodes `transfer-encoding`. And yes, [an HTML parser to parse HTML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). – ivan_pozdeev Jan 21 '16 at 12:46

2 Answers2

0

Please see example that works:

import urllib2
import re

name = '/chapters/16.html'
url = 'http://learnvimscriptthehardway.stevelosh.com'
while len(name) > 0:
    url1 = url + name
    response = urllib2.urlopen(url1)
    data = response.read()
    address = './vim/' + name[-7:]
    with open(address, 'w') as fh:
        fh.write(data)
    x = re.findall('''<a class="next" href="(.+?)"''', data)
    if x:
        name = x[0]
    else:
        break

I am using Python 2.7.10 though. This code downloads all chapters in html format from the url you specified. Notes: replace './vim/' (current dir + vim) for your directory; I used name[-7:], which is last 7 chars, like '16.html' and so on. Conditional 'if' (if x: ...) precludes 'index out of range' error.

Leo Skhrnkv
  • 1,513
  • 16
  • 27
  • This works for you in Python 2.7 because you're taking the encoded HTML and writing it straight to disk without decoding. Your `str()` around `response.read()` is unnecessary. – Alastair McCormack Jan 22 '16 at 21:46
0

Finally I solved this problem, in fact, everything in my code is ok except considering gzip, I should thinks the guy who remind me that :

byte 0x8b in position 1 usually signals that the data stream is gzipped.

After I use gzip module in my code, everything works ok.

zhangzhimin
  • 141
  • 1
  • 10