13

The webpage is something like this:

<h2>section1</h2>
<p>article</p>
<p>article</p>
<p>article</p>

<h2>section2</h2>
<p>article</p>
<p>article</p>
<p>article</p>

How can I find each section with articles within them? That is, after finding h2, find nextsiblings

until the next h2.

If the webpage were like: (which is normally the case)

<div>
<h2>section1</h2>
<p>article</p>
<p>article</p>
<p>article</p>
</div>

<div>
<h2>section2</h2>
<p>article</p>
<p>article</p>
<p>article</p>
</div>

I can write codes like:

for section in soup.findAll('div'):
...
    for post in section.findAll('p')

But what should I do with the first webpage if I want to get the same result?

user1550725
  • 177
  • 1
  • 1
  • 6

2 Answers2

11

I think you can do something like this:

for section in soup.findAll('h2'):
    nextNode = section
    while True:
        nextNode = nextNode.nextSibling
        try:
            tag_name = nextNode.name
        except AttributeError:
            tag_name = ""
        if tag_name == "p":
            print nextNode.string
        else:
            print "*****"
            break

Given:

<h2>section1</h2>
<p>article1</p>
<p>article2</p>
<p>article3</p>

<h2>section2</h2>
<p>article4</p>
<p>article5</p>
<p>article6</p>

Output:

article1
article2
article3
*****
article4
article5
article6
*****
Zubair Afzal
  • 2,016
  • 20
  • 29
  • Thank you. This indeed separate sections, but doesn't seem to make articles belong to a certain section. I would like something that would somehow get the same result as the first example I gave. – user1550725 Jul 27 '12 at 03:45
  • 1
    @user1550725 Please check the solution again. I think this solution separates articles according to their section. – Zubair Afzal Jul 27 '12 at 04:13
  • I'm actually using Calibre to make a recipe for a webpage I want to download. This involves identifying the section and articles within the section (after which they are converted into an e-book). The solution you gave seems to treat section name and articles as the same. – user1550725 Jul 27 '12 at 04:22
  • @user1550725 "soup.findAll('h2')" will give you all sections which are not same as articles. While "section.nextSibling" will give you nextNode which you will check whether it is article(having

    tag) or section. I assumed that the structure will be same as you provided(means only

    and

    ). You are getting sections and articles, its up to you, to treat them same or separately. I hope this clarifies your confusion.

    – Zubair Afzal Jul 27 '12 at 06:33
3

The next_siblings iterator can be helpful here as well:

for i in soup.find_all('h2'):
    for sib in i.next_siblings:
        if sib.name == 'p':
            print(sib.text)
        elif sib.name == 'h2':
            print ("*****")
            break
rtphokie
  • 609
  • 1
  • 6
  • 14