0

I am trying to make a question bank from this website

https://www.neetprep.com/questions/851-Botany/7918-Living-World?courseId=386

I am using the following code

from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
import re

my_url = 'https://www.neetprep.com/questions/851-Botany/7918-Living-World?courseId=386'

uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")

containers = page_soup.findAll("span",{"class": "-PmH"})
print(soup.prettify(containers[0]))

My output is coming as:

<span class="-PmH" id="questionUXVlc3Rpb246NzE3MQ==">
 <p>
  The third name in trinomial nomenclature is
 </p>
 <p>
  (1) Species
 </p>
 <p>
  (2) Subgenus
 </p>
 <p>
  (3) Subspecies
 </p>
 <p>
  (4) Ecotype
 </p>
</span>

Now how do I modify the code to get just the question and the options as my output text.

For this question my output should be

The third name in trinomial nomenclature is
(1) Species
(2) Subgenus
(3) Subspecies
(4) Ecotype

Hence I want to remove the <p> and </p> tags from my output.

MattDMo
  • 100,794
  • 21
  • 241
  • 231
Shahbaz
  • 805
  • 2
  • 8
  • 20
  • you can use the get_text() method available in beautifulsoup. That will remove all HTML tags give you just the text. – whiplash Oct 22 '20 at 16:52
  • `for p in containers[0].findAll('p'): print(p.text)`. You do not need to `prettify` –  Oct 22 '20 at 16:58

1 Answers1

1

Try to change:

print(soup.prettify(containers[0]))

to

print(containers[0].text.split("\n"))
dimay
  • 2,768
  • 1
  • 13
  • 22