I am learning BS4 and took upon myself to scrape a generic multipage website. I wanted to scrape and then follow up with putting the material in JSON file. I referred to How to print paragraph... , Python Beautiful.. , and Extract the text between... but was unable to get a preferred answer.
Here is my tag Structure
<div class ="article-body">
<section class="xyz" >
<h2>title</h2>
<h3>subtitle</h3>
<p>para 1</p>
<p>para 2</p>
<p>para 3</p>
<h3>subtitle</h3>
<p>para 1</p>
<h2>subtitle</h3>
<p>para 1</p>
<p>para 2</p>
</section>
<section> ANOTHER SECTION </section>
</div>
This is my code.. I was only able to get first heading, first subtitle, and first para
from bs4 import BeautifulSoup
url = "https://www.project.com/page-1.aspx"
url2 = "https://www.project.com/page-13.aspx"
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
span = soup.find_all('section', {'id': 'section-body'}, {'class': 'text-body'})
for items in span:
for item in items.find_all(['h2']):
# for title
title = item.find_next("h2").text
# for subtitle
subtitle = item.find_next("h3").text
# for para
para = item.find_next("p").text
print(para)
This is another structure i tried this got me everything joined
for items in span:
# all joined
data = '\n'.join([item.text for item in items.find_all(["h3","p"])])
I even tried to merge paragraphs
subtitle = []
para = []
for items in span:
for item in items.find_all(['p', 'h3']):
if item.name == 'h3':
title = item.text
subtitle.append(title)
print(title)
if item.name =='p':
if item.find_next('h3'):
soup = soup + item.text
para.append(soup)
soup = ''
print(para)
print('******\n\n')
else:
soup = item.text
para.append(soup)
Output I want - in JSON format
[
{
"page": "page-1",
"section_one": [
{
"subtitle": "subtitle here h3"
"para": "para 1 + para 2 + para 3.. joined"
},
{
"subtitle": "subtitle here h3"
"para": "para 1"
},
{
"subtitle": "subtitle here h3"
"para": "para 1 + para 2.. joined"
}
],
"section_two": [
{
"subtitle": "subtitle here h3"
"para": "para 1 + para 2 + para 3.. joined"
}
]
},
{
"page": "page-2"
// Comment - Page 2 related stuff
}
]