40

if a page has <div class="class1"> and <p class="class1">, then soup.findAll(True, 'class1') will find them both.

If it has <p class="class1 class2">, though, it will not be found. How do I find all objects with a certain class, regardless of whether they have other classes, too?

endolith
  • 25,479
  • 34
  • 128
  • 192
  • 3
    **Update**: This has reportedly been fixed in 4 beta 5: https://bugs.launchpad.net/beautifulsoup/+bug/410304 – endolith Feb 16 '12 at 14:43

4 Answers4

34

Unfortunately, BeautifulSoup treats this as a class with a space in it 'class1 class2' rather than two classes ['class1','class2']. A workaround is to use a regular expression to search for the class instead of a string.

This works:

soup.findAll(True, {'class': re.compile(r'\bclass1\b')})
endolith
  • 25,479
  • 34
  • 128
  • 192
19

Just in case anybody comes across this question. BeautifulSoup now supports this:

Python 2.7.5 (default, May 15 2013, 22:43:36) [MSC v.1500 32 bit (Intel)]
Type "copyright", "credits" or "license" for more information.

In [1]: import bs4

In [2]: soup = bs4.BeautifulSoup('<div class="foo bar"></div>')

In [3]: soup(attrs={'class': 'bar'})
Out[3]: [<div class="foo bar"></div>]

Also, you don't have to type findAll anymore.

Kugel
  • 19,354
  • 16
  • 71
  • 103
11

You should use lxml. It works with multiple class values separated by spaces ('class1 class2').

Despite its name, lxml is also for parsing and scraping HTML. It's much, much faster than BeautifulSoup, and it even handles "broken" HTML better than BeautifulSoup (their claim to fame). It has a compatibility API for BeautifulSoup too if you don't want to learn the lxml API.

Ian Bicking agrees and prefers lxml over BeautifulSoup.

There's no reason to use BeautifulSoup anymore, unless you're on Google App Engine or something where anything not purely Python isn't allowed.

You can even use CSS selectors with lxml, so it's far easier to use than BeautifulSoup. Try playing around with it in an interactive Python console.

Inaimathi
  • 13,853
  • 9
  • 49
  • 93
aehlke
  • 15,225
  • 5
  • 36
  • 45
  • 7
    From lxml's own documentation: "While libxml2 (and thus lxml) can also parse broken HTML, BeautifulSoup is a bit more forgiving and has superiour support for encoding detection." – endolith Aug 10 '09 at 18:41
  • Glad you like it. Hope you'll spread the word too, lxml is an under-appreciated library. I think many overlook it since it has 'XML' in the name and its documentation isn't as nice as BeautifulSoup's. BS has a charm to it with the name and graphics, which makes it a little more attractive for superficial reasons. – aehlke Aug 12 '09 at 20:12
  • Yes, it isn't marketed as a scraper and I don't see enough examples of this kind of stuff in the docs. – endolith Aug 15 '09 at 18:19
  • The first link up top was 404ing, so I changed it to the lxml home page. Hopefully this is what was intended. – Inaimathi Mar 28 '13 at 17:09
  • 1
    Beautiful Soup v4 now supports the use of [different parsers](http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser), including lxml. – Sean May 15 '15 at 21:44
  • The question is about BeautifulSoup, not about suggesting YAL (yet another library). :-/ – james-see Jun 03 '15 at 01:16
2

It’s very useful to search for a tag that has a certain CSS class, but the name of the CSS attribute, “class”, is a reserved word in Python. Using class as a keyword argument will give you a syntax error. As of Beautiful Soup 4.1.2, you can search by CSS class using the keyword argument class_:

Like:

soup.find_all("a", class_="class1")
AbcAeffchen
  • 14,400
  • 15
  • 47
  • 66
alan_wang
  • 785
  • 8
  • 9
  • Sorry, but I believe your answer is wrong. According to the Beautiful Soup doc (http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#Searching%20by%20CSS%20class) there are two options to use `find_all` to find a tag with a CSS class: pass the class name as a String or create a dict with the key "class" and a value with the name of the CSS class. – Rodrigo Taboada Feb 03 '15 at 04:32
  • Glad you watch my answer,but i am using bs4,not bs3,maybe the interface has changed@RodrigoTaboada – alan_wang Feb 04 '15 at 08:13
  • According to the Beautiful Soup4 doc [link](http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class) – alan_wang Feb 04 '15 at 08:30
  • Ok, sorry about that. The first item when I searched for find_all was for the bs3 docs and I didn't realize that. – Rodrigo Taboada Feb 04 '15 at 12:46