How to download books automatically from Gutenberg

Question

I am trying to download books from "http://www.gutenberg.org/". I want to know why my code gets nothing.

import requests
import re
import os
import urllib

def get_response(url):
    response = requests.get(url).text 
    return response

def get_content(html):
    reg = re.compile(r'(<span class="mw-headline".*?</span></h2><ul><li>.*</a></li></ul>)',re.S) 
    return re.findall(reg,html)


def get_book_url(response):
    reg = r'a href="(.*?)"'
    return re.findall(reg,response)

def get_book_name(response):
    reg = re.compile('>.*</a>')
    return re.findall(reg,response)


def download_book(book_url,path):
    path = ''.join(path.split())
    path = 'F:\\books\\{}.html'.format(path) #my local file path

    if not os.path.exists(path):
        urllib.request.urlretrieve(book_url,path)
        print('ok!!!')
    else:
        print('no!!!')

def get_url_name(start_url):
    content = get_content(get_response(start_url))
    for i in content:
        book_url = get_book_url(i)
        if book_url:
            book_name = get_book_name(i)
            try:
                download_book(book_url[0],book_name[0])
            except:
                continue

def main():
    get_url_name(start_url)

if __name__ == '__main__':
    start_url = 'http://www.gutenberg.org/wiki/Category:Classics_Bookshelf'
    main()

I have run the code and get nothing, no tracebacks. How can I download the books automatically from the website?

your `re.findall` in `get_contents` returns nothing and therefore the `content` is empty and you cannot loop over it hence you get nothing — , May 09 '18 at 12:17

score 3 · Answer 1 · edited Feb 18 '21 at 19:56

I have run the code and get nothing,no tracebacks.

Well, there's no chance you get a traceback in the case of an exception in download_book() since you explicitely silent them:

        try:
            download_book(book_url[0],book_name[0])
        except:
            continue

So the very first thing you want to do is to at least print out errors:

        try:
            download_book(book_url[0],book_name[0])
        except exception as e:
            print("while downloading book {} : got error {}".format(book_url[0], e)
            continue

or just don't catch exception at all (at least until you know what to expect and how to handle it).

I don't even know how to fix it

Learning how to debug is actually even more important than learning how to write code. For a general introduction, you want to read this first.

For something more python-specific, here are a couple ways to trace your program execution:

1/ add print() calls at the important places to inspect what you really get

2/ import your module in the interactive python shell and test your functions in isolation (this is easier when none of them depend on global variables)

3/ use the builtin step debugger

Now there are a few obvious issues with your code:

1/ you don't test the result of request.get() - an HTTP request can fail for quite a few reasons, and the fact you get a response doesn't mean you got the expected response (you could have a 400+ or 500+ response as well.

2/ you use regexps to parse html. DONT - regexps cannot reliably work on html, you want a proper HTML parser instead (BeautifulSoup is the canonical solution for web scraping as it's very tolerant). Also some of your regexps look quite wrong (greedy match-all etc).

Regex can search HTML fine, it just can't parse it properly. They way he's searching will not cause any problems for him. Of course the regex might be wrong, but doing simple link searching like this shouldn't be a really big problem — Exelian, May 09 '18 at 12:28
@Exelian "you are technically correct, bureaucrat Conrad" - you _can_ of course make this work with regexps (I mean in this specific case), but it's tricky at best, while a proper html parser will make for a code that's easier to write, debug and maintain - and I do have quite some experience with both complex regexps and html parsing / scrapping. — bruno desthuilliers, May 09 '18 at 12:58

Daniel Lee · Answer 2 · 2018-05-09T12:34:43.197

1

start_url is not defined in main()

You need to use a global variable. Otherwise, a better (cleaner) approach is to pass in the variable that you are using. In any case, I would expect an error, start_url is not defined

def main(start_url):
    get_url_name(start_url)

if __name__ == '__main__':
    start_url = 'http://www.gutenberg.org/wiki/Category:Classics_Bookshelf'
    main(start_url)

EDIT:

Nevermind, the problem is in this line: content = get_content(get_response(start_url))

The regex in get_content() does not seem to match anything. My suggestion would be to use BeautifulSoup, from bs4 import BeautifulSoup. For any information regarding why you shouldn't parse html with regex, see this answer RegEx match open tags except XHTML self-contained tags

Asking regexes to parse arbitrary HTML is like asking a beginner to write an operating system

edited May 09 '18 at 12:34

answered May 09 '18 at 12:13

Daniel Lee

7,189
2
26
44

1

no thats not the issue. The issue is with `get_content` the regex returns nothing – May 09 '18 at 12:15
start_url IS defined in global scope, and will be used unless shadowed by a more narrow variable of the same name – folkol May 09 '18 at 12:20
1

If the problem was an undefined variable the OP would get a `NameError`. This is not the case here since variables not found in the local scope are looked up in the enclosing ones and `start_url` IS indeed defined in the global namespace when `get_url_name()` is called. This being said, passing at as an argument would be better indeed. – bruno desthuilliers May 09 '18 at 12:20
I will learn how to use BeautifulSoup later,thanks for your advice. – Manofsteel May 09 '18 at 13:47

score 0 · Answer 3 · answered May 09 '18 at 12:35

As others have said, you get no output because your regex doesn't match anything. The text returned by the initial url has got a newline between </h2> and <ul>, try this instead:

r'(<span class="mw-headline".*?</span></h2>\n<ul><li>.*</a></li></ul>)'

When you fix that one, you will face another error, I suggest some debug printouts like this:

def get_url_name(start_url):
    content = get_content(get_response(start_url))
    for i in content:
        print('[DEBUG] Handling:', i)
        book_url = get_book_url(i)
        print('[DEBUG] book_url:', book_url)
        if book_url:
            book_name = get_book_name(i)
            try:
                print('[DEBUG] book_url[0]:', book_url[0])
                print('[DEBUG] book_name[0]:', book_name[0])
                download_book(book_url[0],book_name[0])
            except:
                continue

I tried your code,and get the debugs,but how can I fix it?I wrote this code myself with some references,but I am new to python,and know little about how to use it correctly.This code I have tried on another website which is much easier than "http://www.gutenberg.org/",and it worked well for downloading the jpg pictures on the website. — Manofsteel, May 09 '18 at 13:54
The problem is not Python related, sort out your regexes and you should be fine. — folkol, May 09 '18 at 17:14

How to download books automatically from Gutenberg

3 Answers3