13

Possible Duplicate:
Beautiful Soup cannot find a CSS class if the object has other classes, too

I'm using BeautifulSoup to find tables in the HTML. The problem I am currently running into is the use of spaces in the class attribute. If my HTML reads <html><table class="wikitable sortable">blah</table></html>, I can't seem to extract it with the following (where I was to be able to find tables with both wikipedia and wikipedia sortable for the class):

BeautifulSoup(html).findAll(attrs={'class':re.compile("wikitable( sortable)?")})

This will find the table if my HTML is just <html><table class="wikitable">blah</table></html> though. Likewise, I have tried using "wikitable sortable" in my regular expression, and that won't match either. Any ideas?

Community
  • 1
  • 1
cryptic_star
  • 1,863
  • 3
  • 26
  • 47

2 Answers2

24

The pattern match will also fail if wikitable appears after another CSS class, as in class="something wikitable other", so if you want all tables whose class attribute contains the class wikitable, you need a pattern that accepts more possibilities:

html = '''<html><table class="sortable wikitable other">blah</table>
<table class="wikitable sortable">blah</table>
<table class="wikitable"><blah></table></html>'''

tree = BeautifulSoup(html)
for node in tree.findAll(attrs={'class': re.compile(r".*\bwikitable\b.*")}):
    print node

Result:

<table class="sortable wikitable other">blah</table>
<table class="wikitable sortable">blah</table>
<table class="wikitable"><blah></blah></table>

Just for the record, I don't use BeautifulSoup, and prefer to use lxml, as others have mentioned.

samplebias
  • 37,113
  • 6
  • 107
  • 103
  • 2
    Just as an update, the latest version of BeautifulSoup (bs4) handles this much more elegantly: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class – Eli Jul 22 '13 at 20:50
8

One of the things that makes lxml better than BeautifulSoup is support for proper CSS-like class selection (or even supports full css selectors if you want to use them)

import lxml.html

html = """<html>
<body>
<div class="bread butter"></div>
<div class="bread"></div>
</body>
</html>"""

tree = lxml.html.fromstring(html)

elements = tree.find_class("bread")

for element in elements:
    print lxml.html.tostring(element)

Gives:

<div class="bread butter"></div>
<div class="bread"></div>
Acorn
  • 49,061
  • 27
  • 133
  • 172
  • +1 Even though this doesn't help @allie write BeautifulSoup code, lxml is far superior. – Henry May 04 '11 at 23:00
  • While I appreciate the elegance of that, BeautifulSoup is what is already here, and for the time being, that's what I need to use. :) – cryptic_star May 04 '11 at 23:21
  • The reason so many people prefer BS for html and lxml for XML is because it (BS) is far more tollerant of broken html. lxml doesn't handle broken html well. – SkyLeach Jul 18 '18 at 23:51