How to scrape paragraphs from websites using python?

Question

I am trying to make a question bank from this website

https://www.neetprep.com/questions/851-Botany/7918-Living-World?courseId=386

I am using the following code

from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
import re

my_url = 'https://www.neetprep.com/questions/851-Botany/7918-Living-World?courseId=386'

uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")

containers = page_soup.findAll("span",{"class": "-PmH"})
print(soup.prettify(containers[0]))

My output is coming as:

<span class="-PmH" id="questionUXVlc3Rpb246NzE3MQ==">
 <p>
  The third name in trinomial nomenclature is
 </p>
 <p>
  (1) Species
 </p>
 <p>
  (2) Subgenus
 </p>
 <p>
  (3) Subspecies
 </p>
 <p>
  (4) Ecotype
 </p>
</span>

Now how do I modify the code to get just the question and the options as my output text.

For this question my output should be

The third name in trinomial nomenclature is
(1) Species
(2) Subgenus
(3) Subspecies
(4) Ecotype

Hence I want to remove the <p> and </p> tags from my output.

you can use the get_text() method available in beautifulsoup. That will remove all HTML tags give you just the text. — whiplash, Oct 22 '20 at 16:52
`for p in containers[0].findAll('p'): print(p.text)`. You do not need to `prettify` — , Oct 22 '20 at 16:58

score 1 · Accepted Answer · answered Oct 22 '20 at 16:50

1

Try to change:

print(soup.prettify(containers[0]))

to

print(containers[0].text.split("\n"))

answered Oct 22 '20 at 16:50

dimay

2,768
1
13
22

You don't even need to split on newlines – MattDMo Oct 22 '20 at 16:53
Just the containers[0].text gave my desired output. Thanks a lot! – Shahbaz Oct 22 '20 at 17:21

How to scrape paragraphs from websites using python?

1 Answers1