1

How do you use Beautiful Soup to pull out list items that have certain class attributes or dont have certain clas attributes?

For example, from the below HTML, I'd like to pull out only the list items that have the class attribute "lev1" (i.e. children). I'd also like to pull out the list items that dont have a class attribute (i.e. Parents), but I'd like to do these two things separately (meaning I want to pull out only the lists items with the class attribute "lev1" and then pull out only the list items with no class attribute.

<h3>HeaderName1<h3>
<ul class="prodoplist">
 <li>Parent</li>
 <li class="lev1">Child1</li>
 <li class="lev1">Child2</li>
 <li class="lev1">Child3</li>
  </ul>
  <h3>HeaderName2<h3>
   <ul class="prodoplist">
   <li>Parent2</li>
   <li class="lev1">Child1</li>
   <li class="lev1">Child2</li>
   <li class="lev1">Child3</li>
   </ul>

My end goal is to produce something like this.

[[HeaderName1,Parent1,Child1],[HeaderName1,Parent1,Child2],[HeaderName1,Parent1,Child3],   [HeaderName2,Parent2,Child1],[HeaderName2,Parent2,Child2],[HeaderName2,Parent2,Child3]]

So far all I have is this:

soup.h3.findNext('ul').contents

This pulls this out:

 <li>Parent</li>
 <li class="lev1">Child1</li>
 <li class="lev1">Child2</li>
 <li class="lev1">Child3</li>
 <li>Parent2</li>
 <li class="lev1">Child1</li>
<li class="lev1">Child2</li>
<li class="lev1">Child3</li>

And then I apply this, but it gives me both Child and Parent, when I want to pull them separately

[x.text for x in duns_brands_html]
Chris
  • 5,444
  • 16
  • 63
  • 119

1 Answers1

1
for h3 in soup.find_all('h3'):
    ul = h3.find_next_sibling('ul')
    lis = ul.findChildren('li')
    for i in range(3):
        print [h3.text, 
               lis[0].text, 
               lis[i].text]

output:

[u'HeaderName1', u'Parent', u'Parent']
[u'HeaderName1', u'Parent', u'Child1']
[u'HeaderName1', u'Parent', u'Child2']
[u'HeaderName2', u'Parent2', u'Parent2']
[u'HeaderName2', u'Parent2', u'Child4']
[u'HeaderName2', u'Parent2', u'Child5']
Guy Gavriely
  • 11,228
  • 6
  • 27
  • 42