0

I am trying to get the blog content from this blog post and by content, I just mean the first six paragraphs. This is what I've come up with so far:

soup = BeautifulSoup(url, 'lxml')
body = soup.find('div', class_='post-body')

Printing body will also include other stuff under the main div tag.

SIM
  • 21,997
  • 5
  • 37
  • 109
Bargain23
  • 1,863
  • 3
  • 29
  • 50

2 Answers2

3

Try this:

import requests ; from bs4 import BeautifulSoup

res = requests.get("http://www.fashionpulis.com/2017/08/being-proud-too-soon.html").text
soup = BeautifulSoup(res, 'html.parser')
for item in soup.select("div#post-body-604825342214355274"):
    print(item.text.strip())

Use this:

import requests ; from bs4 import BeautifulSoup

res = requests.get("http://www.fashionpulis.com/2017/08/acceptance-is-must.html").text
soup = BeautifulSoup(res, 'html.parser')
for item in soup.select("div[id^='post-body-']"):
    print(item.text)
Andersson
  • 51,635
  • 17
  • 77
  • 129
SIM
  • 21,997
  • 5
  • 37
  • 109
  • Thanks! However, I was wondering if there's a way for the scraper to be reused for similar blog posts. Example, [this one from the same site](http://www.fashionpulis.com/2017/08/acceptance-is-must.html) has a different id but similar format. – Bargain23 Aug 22 '17 at 18:27
  • So, the post-body id has to be hard-coded each time? – Bargain23 Aug 22 '17 at 18:33
  • 1
    Another way is `import re; soup.findAll('div', class_=re.compile('post-body'))`. BeautifulSoup handles that natively – Wondercricket Aug 22 '17 at 18:43
  • @Bargain23, have you tried the both urls in the second script? It can grab content from both the sites as well. Thanks. – SIM Aug 22 '17 at 19:01
1

I found this solution very interesting: Scrape multiple pages with BeautifulSoup and Python

However, I haven't found any Query String Parameters to tackle on, maybe you can start something out of this approach.

What I find most obvious to do right now is something like this:

  1. Scrape through every month and year and get all titles from the Blog Archive part of the pages (e.g. on http://www.fashionpulis.com/2017/03/ and so on)
  2. Build the URLs using the titles and the according months/years (the URL is always http://www.fashionpulis.com/$YEAR/$MONTH/$TITLE.html)
  3. Scrape the text as described by Shahin in a previous answer
Rich Steinmetz
  • 1,020
  • 13
  • 28