1

I am very new to anything webscraping related and as I understand Requests and BeautifulSoup are the way to go in that. I want to write a program which emails me only one paragraph of a given link every couple of hours (trying a new way to read blogs through the day) Say this particular link 'https://fs.blog/mental-models/' has a a paragraph each on different models.

from bs4 import BeautifulSoup
import re
import requests


url = 'https://fs.blog/mental-models/'

r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')

now soup has a wall of bits before the paragraph text begins: <p> this is what I want to read </p>

soup.title.string working perfectly fine, but I don't know how to move ahead from here pls.. any directions?

thanks

spiff
  • 1,335
  • 3
  • 11
  • 23
  • if I could just have a list of paragraphs as output, then I can loop through it and read each paragraph (this paragraph could well be the next sub-heading too) – spiff Mar 18 '19 at 09:16
  • All the paragraphs from that page? – DirtyBit Mar 18 '19 at 09:16
  • Have you looked at the [documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)? It's very good and thorough - and it shouldn't take much searching in there to find what you need. – Robin Zigmond Mar 18 '19 at 09:17
  • @DirtyBit Indeed, just for a given link (it will always be blogs) – spiff Mar 18 '19 at 09:18
  • @spiff, see if the answer posted below helps? – DirtyBit Mar 18 '19 at 09:23

3 Answers3

2

Loop over the soup.findAll('p') to find all the p tags and then use .text to get their text:

Furthermore, do all that under a div with the class rte since you don't want the footer paragraphs.

from bs4 import BeautifulSoup
import requests

url = 'https://fs.blog/mental-models/'    
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')

divTag = soup.find_all("div", {"class": "rte"})    
for tag in divTag:
    pTags = tag.find_all('p')
    for tag in pTags[:-2]:  # to trim the last two irrelevant looking lines
        print(tag.text)

OUTPUT:

Mental models are how we understand the world. Not only do they shape what we think and how we understand but they shape the connections and opportunities that we see.
.
.
.
5. Mutually Assured Destruction
Somewhat paradoxically, the stronger two opponents become, the less likely they may be to destroy one another. This process of mutually assured destruction occurs not just in warfare, as with the development of global nuclear warheads, but also in business, as with the avoidance of destructive price wars between competitors. However, in a fat-tailed world, it is also possible that mutually assured destruction scenarios simply make destruction more severe in the event of a mistake (pushing destruction into the “tails” of the distribution).

 

DirtyBit
  • 16,613
  • 4
  • 34
  • 55
  • 1
    Nice answer clean and elegant – Iakovos Belonias Mar 18 '19 at 09:22
  • this is beautiful - thanks vm! Just to close this out - if I only want to see one paragraph - so say I put each paragraph/ subheading in a list? so I can call only one of them at a time? – spiff Mar 18 '19 at 09:27
  • @spiff Indeed, depends on how you want it. you could put a check with `\n` new line to get separated paragraphs. – DirtyBit Mar 18 '19 at 09:30
  • 1
    sweet - so am just appending pTags[n].text to a list and that will be used to separate each line.. thanks again! – spiff Mar 18 '19 at 09:37
1

If you want the text of all the p tag, you can just loop on them using the find_all method:

from bs4 import BeautifulSoup
import re
import requests


url = 'https://fs.blog/mental-models/'

r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
print(soup)

data = soup.find_all('p')
for p in data:
    text = p.get_text()
    print(text)

EDIT:

Here is the code in order to have them separatly in a list. You can them apply a loop on the result list to remove empty string, unused characters like\n etc...

from bs4 import BeautifulSoup
import re
import requests


url = 'https://fs.blog/mental-models/'

r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')

data = soup.find_all('p')
result = []
for p in data:
    result.append(p.get_text())

print(result)
Maaz
  • 2,405
  • 1
  • 15
  • 21
  • hey @Maaz thanks vm! this works too, just need to get rid of the tags right? – spiff Mar 18 '19 at 09:29
  • 1
    Here you have only the text, not the tag, you can put the text in a list if you want to get them separatly. See my edit – Maaz Mar 18 '19 at 09:35
1

Here is the solution:

from bs4 import BeautifulSoup
import requests
import Clock

url = 'https://fs.blog/mental-models/'  
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
data = soup.find_all('p')

result = []

for p in data:
    result.append(p.get_text())

Clock.schedule_interval(print(result), 60)
nkr
  • 3,026
  • 7
  • 31
  • 39
elvis
  • 9
  • 3
  • ah - thank you so much @elvis ... didnt know about clock.. now trying to connect python to my email. my actual plan was to instead make a Telegram bot, which would send me these paragraphs.. but will start with email for now.. thanks again! – spiff Mar 19 '19 at 01:11