1

I would like to use BeautifulSoup in Python to parse html from such html

<p><b>Background</b><br />x0</p><p>x1</p>
<p><b>Innovation</b><br />x2</p><p>x3</p><p>x4</p>
<p><b>Activities</b><br />x5</p><p>x6</p>"

to this result:

Background: x0, x1
Innovation: x2, x3, x4
Activities: x5, x6

I have tired to use the python scripts below:

from bs4 import BeautifulSoup
htmltext = "<p><b>Background</b><br />x0</p><p>x1</p>
         <p><b>Innovation</b><br />x2</p><p>x3</p><p>x4</p>
         <p><b>Activities</b><br />x5</p><p>x6</p>"
html = BeautifulSoup(htmltext)
for n in html.find_all('b'):
    title_name = n.next_element
    title_content = n.nextSibling.nextSibling
    print title_name, title_content

However, I can only get this:

Background: x0
Innovation: x2
Activities: x5

Your comments are welcome and your suggestions will be appreciated.

Frank Wang
  • 1,462
  • 3
  • 17
  • 39

3 Answers3

2

In <p><b>Innovation</b><br />x2</p><p>x3</p><p>x4</p> you are going to the <b> element and locating x2 thought next_element. That's all good. But to locate x3 and x4 you need first to go up in the element hierarchy to the enclosing <p> element and from there locate the following <p>s enclosing x3 and x4.

Mario Rossi
  • 7,651
  • 27
  • 37
  • Ideally, it should be so. However, it seems that practically, when people code with html to write subtitles and the respective paragraphs, there is no specified element hierarchy for the enclosing

    .

    – Frank Wang Aug 23 '13 at 18:08
1

I'm pretty new to beautifulsoup, but this is working for me:

import bs4
from bs4 import BeautifulSoup

htmls = """<p><b>Background</b><br />x0</p><p>x1</p>
           <p><b>Innovation</b><br />x2</p><p>x3</p><p>x4</p>
           <p><b>Activities</b><br />x5</p><p>x6</p>"""
html = BeautifulSoup(htmls)

for n in html.find_all('b'):
    title_name = n.next_element
    title_content = n.nextSibling.nextSibling

    results = [title_content]
    for f in n.parent.find_next_siblings():
        el = f.next_element
        if isinstance(el, bs4.element.Tag) and el.name == 'b':
            break
        results.append(el)

    print title_name, results

Results:

Background [u'x0', u'x1']
Innovation [u'x2', u'x3', u'x4']
Activities [u'x5', u'x6']

I chose to use isinstance(el, bs4.element.Tag) and el.name == 'b' as the delimiter because in your example the <p> tags you are trying to capture have no children. This part should probably be a little different depending on the real webpage you are parsing.

mr2ert
  • 5,146
  • 1
  • 21
  • 32
  • As a side note -- it's best to avoid `isinstance` whenever possible when working in Python, because one of the benefits of object oriented programming is exactly that these kinds of checks are unnecessary. See, for a full explanation, [this question](http://stackoverflow.com/questions/1549801/differences-between-isinstance-and-type-in-python). – Patrick Collins Aug 26 '13 at 12:51
0

You're stopping after reading one more tag, you need to keep going until you hit the next <b>. nextSibiling isn't going to work because the <p>'s you're parsing aren't siblings of the <b>'s. Try something like this:

def in_same_section(n):
    try:
        return n.next_element.name != u'b'
    except AttributeError:
        return True


from bs4 import BeautifulSoup
htmltext ='''<p><b>Background</b><br />x0</p><p>x1</p>
         <p><b>Innovation</b><br />x2</p><p>x3</p><p>x4</p>
         <p><b>Activities</b><br />x5</p><p>x6</p>'''
html = BeautifulSoup(htmltext)
for n in html.find_all('b'):
    title_name = n.string
    title_content = []
    while in_same_section(n):
        n = n.next_element
        try:
            if n.name == u'p':
                title_content += n.string
        except AttributeError:
            pass

EDIT: Fixed the AttributeError, I think? I'm at work and can't test this code.

Patrick Collins
  • 10,306
  • 5
  • 30
  • 69
  • Fails at `while n.next_element.name != u'b':` with `AttributeError: 'NavigableString' object has no attribute 'name'` for me. – mr2ert Aug 23 '13 at 18:40