How do I scrape multiple
s under multiple
and
into jSON

Question

I am learning BS4 and took upon myself to scrape a generic multipage website. I wanted to scrape and then follow up with putting the material in JSON file. I referred to How to print paragraph... , Python Beautiful.. , and Extract the text between... but was unable to get a preferred answer.

Here is my tag Structure

<div class ="article-body">
<section class="xyz" >
<h2>title</h2>

<h3>subtitle</h3>
<p>para 1</p>
<p>para 2</p>
<p>para 3</p>

<h3>subtitle</h3>
<p>para 1</p>

<h2>subtitle</h3>
<p>para 1</p>  
<p>para 2</p> 
  
</section>

<section> ANOTHER SECTION </section>
</div>

This is my code.. I was only able to get first heading, first subtitle, and first para

from bs4 import BeautifulSoup

url = "https://www.project.com/page-1.aspx"
url2 = "https://www.project.com/page-13.aspx"
    
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')

span = soup.find_all('section', {'id': 'section-body'}, {'class': 'text-body'})

for items in span:
    
    for item in items.find_all(['h2']):
        # for title
        title = item.find_next("h2").text
        
        # for subtitle
        subtitle = item.find_next("h3").text
        
        # for para
        para = item.find_next("p").text
        
        print(para)

This is another structure i tried this got me everything joined

for items in span:
    # all joined
    data = '\n'.join([item.text for item in items.find_all(["h3","p"])])

I even tried to merge paragraphs

subtitle = []
para = []
for items in span:
    for item in items.find_all(['p', 'h3']):
        if item.name == 'h3':
            title = item.text
            subtitle.append(title)
            print(title)
            
        if item.name =='p':
                if item.find_next('h3'):
                    soup = soup + item.text
                    para.append(soup)
                    soup = ''
                    print(para)
                    print('******\n\n')
                else:
                    soup = item.text
                    para.append(soup)

Output I want - in JSON format

[
{
"page": "page-1",
"section_one": [
        {
            "subtitle": "subtitle here h3"
            "para": "para 1 + para 2 + para 3.. joined"
        },
        {
             "subtitle": "subtitle here h3"
             "para": "para 1"
        },
        {
              "subtitle": "subtitle here h3"
               "para": "para 1 + para 2.. joined"
         }
     ],
"section_two": [
         {
"subtitle": "subtitle here h3"
"para": "para 1 + para 2 + para 3.. joined"
         }
    ]
},


{
"page": "page-2"
// Comment - Page 2 related stuff
}

]

How close is your current output? Doesn't look like you're making any dictionaries for outputting json — OneCricketeer, Jan 18 '21 at 17:22
https://stackoverflow.com/questions/11647348/find-next-siblings-until-a-certain-one-using-beautifulsoup — frab, Jan 18 '21 at 18:05
Does this answer your question? [Find next siblings until a certain one using beautifulsoup](https://stackoverflow.com/questions/11647348/find-next-siblings-until-a-certain-one-using-beautifulsoup) — frab, Jan 18 '21 at 18:06

How do I scrape multiple s under multiple and into jSON

and

into jSON

0 Answers0

How do I scrape multiple
s under multiple
and
into jSON