Find next siblings until a certain one using beautifulsoup

Question

The webpage is something like this:

<h2>section1</h2>
<p>article</p>
<p>article</p>
<p>article</p>

<h2>section2</h2>
<p>article</p>
<p>article</p>
<p>article</p>

How can I find each section with articles within them? That is, after finding h2, find nextsiblings

until the next h2.

If the webpage were like: (which is normally the case)

<div>
<h2>section1</h2>
<p>article</p>
<p>article</p>
<p>article</p>
</div>

<div>
<h2>section2</h2>
<p>article</p>
<p>article</p>
<p>article</p>
</div>

I can write codes like:

for section in soup.findAll('div'):
...
    for post in section.findAll('p')

But what should I do with the first webpage if I want to get the same result?

score 11 · Accepted Answer · answered Jul 25 '12 at 11:35

11

I think you can do something like this:

for section in soup.findAll('h2'):
    nextNode = section
    while True:
        nextNode = nextNode.nextSibling
        try:
            tag_name = nextNode.name
        except AttributeError:
            tag_name = ""
        if tag_name == "p":
            print nextNode.string
        else:
            print "*****"
            break

Given:

<h2>section1</h2>
<p>article1</p>
<p>article2</p>
<p>article3</p>

<h2>section2</h2>
<p>article4</p>
<p>article5</p>
<p>article6</p>

Output:

article1
article2
article3
*****
article4
article5
article6
*****

answered Jul 25 '12 at 11:35

Zubair Afzal

2,016
20
29

Thank you. This indeed separate sections, but doesn't seem to make articles belong to a certain section. I would like something that would somehow get the same result as the first example I gave. – user1550725 Jul 27 '12 at 03:45
1

@user1550725 Please check the solution again. I think this solution separates articles according to their section. – Zubair Afzal Jul 27 '12 at 04:13
I'm actually using Calibre to make a recipe for a webpage I want to download. This involves identifying the section and articles within the section (after which they are converted into an e-book). The solution you gave seems to treat section name and articles as the same. – user1550725 Jul 27 '12 at 04:22
@user1550725 "soup.findAll('h2')" will give you all sections which are not same as articles. While "section.nextSibling" will give you nextNode which you will check whether it is article(having
tag) or section. I assumed that the structure will be same as you provided(means only
and
). You are getting sections and articles, its up to you, to treat them same or separately. I hope this clarifies your confusion.
– Zubair Afzal Jul 27 '12 at 06:33

score 3 · Answer 2 · answered Jan 11 '20 at 16:48

The next_siblings iterator can be helpful here as well:

for i in soup.find_all('h2'):
    for sib in i.next_siblings:
        if sib.name == 'p':
            print(sib.text)
        elif sib.name == 'h2':
            print ("*****")
            break

Find next siblings until a certain one using beautifulsoup

2 Answers2

and

Linked