How do I extract just the blog content and exclude other elements using Beautiful Soup

Question

I am trying to get the blog content from this blog post and by content, I just mean the first six paragraphs. This is what I've come up with so far:

soup = BeautifulSoup(url, 'lxml')
body = soup.find('div', class_='post-body')

Printing body will also include other stuff under the main div tag.

score 3 · Accepted Answer · edited Aug 22 '17 at 18:36

3

Try this:

import requests ; from bs4 import BeautifulSoup

res = requests.get("http://www.fashionpulis.com/2017/08/being-proud-too-soon.html").text
soup = BeautifulSoup(res, 'html.parser')
for item in soup.select("div#post-body-604825342214355274"):
    print(item.text.strip())

Use this:

import requests ; from bs4 import BeautifulSoup

res = requests.get("http://www.fashionpulis.com/2017/08/acceptance-is-must.html").text
soup = BeautifulSoup(res, 'html.parser')
for item in soup.select("div[id^='post-body-']"):
    print(item.text)

edited Aug 22 '17 at 18:36

Andersson

51,635
17
77
129

answered Aug 22 '17 at 18:18

SIM

21,997
5
37
109

Thanks! However, I was wondering if there's a way for the scraper to be reused for similar blog posts. Example, [this one from the same site](http://www.fashionpulis.com/2017/08/acceptance-is-must.html) has a different id but similar format. – Bargain23 Aug 22 '17 at 18:27
So, the post-body id has to be hard-coded each time? – Bargain23 Aug 22 '17 at 18:33
1

Another way is `import re; soup.findAll('div', class_=re.compile('post-body'))`. BeautifulSoup handles that natively – Wondercricket Aug 22 '17 at 18:43
@Bargain23, have you tried the both urls in the second script? It can grab content from both the sites as well. Thanks. – SIM Aug 22 '17 at 19:01

score 1 · Answer 2 · answered Aug 22 '17 at 19:25

I found this solution very interesting: Scrape multiple pages with BeautifulSoup and Python

However, I haven't found any Query String Parameters to tackle on, maybe you can start something out of this approach.

What I find most obvious to do right now is something like this:

Scrape through every month and year and get all titles from the Blog Archive part of the pages (e.g. on http://www.fashionpulis.com/2017/03/ and so on)
Build the URLs using the titles and the according months/years (the URL is always http://www.fashionpulis.com/$YEAR/$MONTH/$TITLE.html)
Scrape the text as described by Shahin in a previous answer

How do I extract just the blog content and exclude other elements using Beautiful Soup

2 Answers2