Using BeautifulSoup to parse multiple paragraphs in Python

Question

I would like to use BeautifulSoup in Python to parse html from such html

<p><b>Background</b><br />x0</p><p>x1</p>
<p><b>Innovation</b><br />x2</p><p>x3</p><p>x4</p>
<p><b>Activities</b><br />x5</p><p>x6</p>"

to this result:

Background: x0, x1
Innovation: x2, x3, x4
Activities: x5, x6

I have tired to use the python scripts below:

from bs4 import BeautifulSoup
htmltext = "<p><b>Background</b><br />x0</p><p>x1</p>
         <p><b>Innovation</b><br />x2</p><p>x3</p><p>x4</p>
         <p><b>Activities</b><br />x5</p><p>x6</p>"
html = BeautifulSoup(htmltext)
for n in html.find_all('b'):
    title_name = n.next_element
    title_content = n.nextSibling.nextSibling
    print title_name, title_content

However, I can only get this:

Background: x0
Innovation: x2
Activities: x5

Your comments are welcome and your suggestions will be appreciated.

Shouldn't the sample html assigned to `htmls` be a docstring? — mr2ert, Aug 23 '13 at 17:59
What exactly are the conditions for being included in the result? Do you want the innertext of every `
` element in between successive `` elements? — Patrick Collins, Aug 23 '13 at 21:20

score 2 · Answer 1 · answered Aug 23 '13 at 18:01

2

In Innovation x2x3x4 you are going to the  element and locating x2 thought next_element. That's all good. But to locate x3 and x4 you need first to go up in the element hierarchy to the enclosing  element and from there locate the following s enclosing x3 and x4.

answered Aug 23 '13 at 18:01

Mario Rossi

7,651
27
37

Ideally, it should be so. However, it seems that practically, when people code with html to write subtitles and the respective paragraphs, there is no specified element hierarchy for the enclosing
.
– Frank Wang Aug 23 '13 at 18:08

mr2ert · Accepted Answer · 2013-08-23T18:51:00.417

I'm pretty new to beautifulsoup, but this is working for me:

import bs4
from bs4 import BeautifulSoup

htmls = """<p><b>Background</b><br />x0</p><p>x1</p>
           <p><b>Innovation</b><br />x2</p><p>x3</p><p>x4</p>
           <p><b>Activities</b><br />x5</p><p>x6</p>"""
html = BeautifulSoup(htmls)

for n in html.find_all('b'):
    title_name = n.next_element
    title_content = n.nextSibling.nextSibling

    results = [title_content]
    for f in n.parent.find_next_siblings():
        el = f.next_element
        if isinstance(el, bs4.element.Tag) and el.name == 'b':
            break
        results.append(el)

    print title_name, results

Results:

Background [u'x0', u'x1']
Innovation [u'x2', u'x3', u'x4']
Activities [u'x5', u'x6']

I chose to use isinstance(el, bs4.element.Tag) and el.name == 'b' as the delimiter because in your example the  tags you are trying to capture have no children. This part should probably be a little different depending on the real webpage you are parsing.

As a side note -- it's best to avoid `isinstance` whenever possible when working in Python, because one of the benefits of object oriented programming is exactly that these kinds of checks are unnecessary. See, for a full explanation, [this question](http://stackoverflow.com/questions/1549801/differences-between-isinstance-and-type-in-python). — Patrick Collins, Aug 26 '13 at 12:51

Patrick Collins · Answer 3 · 2013-08-23T18:55:46.817

You're stopping after reading one more tag, you need to keep going until you hit the next . nextSibiling isn't going to work because the 's you're parsing aren't siblings of the 's. Try something like this:

def in_same_section(n):
    try:
        return n.next_element.name != u'b'
    except AttributeError:
        return True


from bs4 import BeautifulSoup
htmltext ='''<p><b>Background</b><br />x0</p><p>x1</p>
         <p><b>Innovation</b><br />x2</p><p>x3</p><p>x4</p>
         <p><b>Activities</b><br />x5</p><p>x6</p>'''
html = BeautifulSoup(htmltext)
for n in html.find_all('b'):
    title_name = n.string
    title_content = []
    while in_same_section(n):
        n = n.next_element
        try:
            if n.name == u'p':
                title_content += n.string
        except AttributeError:
            pass

EDIT: Fixed the AttributeError, I think? I'm at work and can't test this code.

Fails at `while n.next_element.name != u'b':` with `AttributeError: 'NavigableString' object has no attribute 'name'` for me. — mr2ert, Aug 23 '13 at 18:40

Using BeautifulSoup to parse multiple paragraphs in Python

3 Answers3