scrape book body text from project gutenberg de

Question

I am new to python and I am looking for a way to extract with beautiful soup existing open source books that are available on gutenberg-de, such as this one I need to use them for further analysis and text mining.

I tried this code, found in a tutorial, and it extracts metadata, but instead of the body content it gives me a list of the "pages" I need to scrape the text from.

import requests
from bs4 import BeautifulSoup

# Make a request
page = requests.get(
    "https://www.projekt-gutenberg.org/keller/heinrich/")
soup = BeautifulSoup(page.content, 'html.parser')

# Extract title of page
page_title = soup.title

# Extract body of page
page_body = soup.body

# Extract head of page
page_head = soup.head

# print the result
print(page_title, page_head)

I suppose I could use that as a second step to extract it then? I am not sure how, though.

Ideally I would like to store them in a tabular way and be able to save them as csv, preserving the metadata author, title, year, and chapter. any ideas?

needless to state the omnipresent truth, **there are no free lunches**!. What have you tried so far, what worked and what did not work? — mnm, Jan 26 '21 at 09:37
thanks, sure. I did not think that in this case it would be of any help, but there it is now, I have updated the question :) — Grig, Jan 26 '21 at 09:53

score 0 · Accepted Answer · answered Jan 26 '21 at 10:32

0

What happens?

First of all you will get a list of pages, cause you not requesting the right url it to:

page = requests.get('https://www.projekt-gutenberg.org/keller/heinrich/hein101.html')

Recommend that if your looping all the urls store the content in a list of dicts and push it to csv or pandas or ...

Example

import requests
from bs4 import BeautifulSoup

data = []

# Make a request
page = requests.get('https://www.projekt-gutenberg.org/keller/heinrich/hein101.html')
soup = BeautifulSoup(page.content, 'html.parser')

data.append({
    'title': soup.title,
    'chapter': soup.h2.get_text(),
    'text': ' '.join([p.get_text(strip=True) for p in soup.select('body p')[2:]])
    }
)

data

answered Jan 26 '21 at 10:32

HedgeHog

22,146
4
14
36

Hi, thanks for the suggestion! I tried your code but it only extracts one page, rather than the whole book, and I did not understand well what you mean with "looping all the urls". Are you suggesting that I manually make a list of all the url for each page? Wouldn-t that kind of annihilate the scope of scraping? – Grig Jan 26 '21 at 14:00
That is correct, it extracts one page if you use it in a loop you get all extracts. Nope, I am not suggesting to make it manually, but SO is no free coding service and your question needs to improvement, cause it is not clear, what you want exactly. So my suggestion ist just a part of the puzzle :) – HedgeHog Jan 26 '21 at 14:08
Sorry I thought it was clear enough! so I need to extract a whole book (and then more books). I am trying with the code above (and yours) but at this time I am not managing, so I am looking for a suggestion on how to do that. Stack Overflow might not be a "free coding service" but I have always find people willing to help! (including you) :) So the question now is: how do I loop on those ursl if not manually? – Grig Jan 26 '21 at 14:13
Would recommend, accept that answer and close these question - Concern looping the urls, open up an new improved one and mention me. – HedgeHog Jan 26 '21 at 14:26

scrape book body text from project gutenberg de

1 Answers1

What happens?