10

I've got html that contains entries like this:

<div class="entry">
  <h3 class="foo">
    <a href="http://www.example.com/blog-entry-slug"
    rel="bookmark">Blog Entry</a>
  </h3>
  ...
</div>

and I would like to extract the text "Blog Entry" (and a number of other attributes, so I'm looking for a generic answer).

In jQuery, I would do

$('.entry a[rel=bookmark]').text()

the closest I've been able to get in Python is:

from BeautifulSoup import BeautifulSoup
import soupselect as soup

rawsoup = BeautifulSoup(open('fname.html').read())

for entry in rawsoup.findAll('div', 'entry'):
    print soup.select(entry, 'a[rel=bookmark]')[0].string.strip()

soupselect from http://code.google.com/p/soupselect/.

Soupselect doesn't understand the full CSS3 selector syntax, like jQuery does however. Is there such a beast in Python?

BoltClock
  • 700,868
  • 160
  • 1,392
  • 1,356
thebjorn
  • 26,297
  • 11
  • 96
  • 138

4 Answers4

14

You might want to take a look at lxml's CSSSelector class which tries to implement CSS selectors as described in the w3c specification. As a side note, many folks recommend lxml for parsing HTML/XML over BeautifulSoup now, for performance and other reasons.

I think lxml's CSSSelector uses XPath for element selection, but you might want to check the documentation for yourself. Here's your example with lxml:

>>> from lxml.cssselect import CSSSelector
>>> from lxml.html import fromstring
>>> html = '<div class="entry"><h3 class="foo"><a href="http://www.example.com/blog-entry-slug" rel="bookmark">Blog Entry</a></h3></div>'
>>> h = fromstring(html)
>>> sel = CSSSelector("a[rel=bookmark]")
>>> [e.text for e in sel(h)]
['Blog Entry']
Community
  • 1
  • 1
Haes
  • 12,891
  • 11
  • 46
  • 50
  • 1
    That didn't work for me for some reason (fromstring seems to want valid html *lol*), but one of the links you gave led me towards pyquery. The motivation for pyquery was "Hey let's make jquery in python", and from my preliminary testing I've been able to rely on my knowledge of jQuery instead of reading the docs(!) – thebjorn Dec 13 '10 at 12:59
  • 2
    Use "from lxml.html import fromstring" for malformed html – Saurav Dec 02 '11 at 03:24
3

You might also want to have a look at pyquery. pyquery is a jquery-like library for python. Find it here

Aman Aggarwal
  • 3,905
  • 4
  • 26
  • 38
2

It's really very easy using keyword arguments.

>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup('''<div class="entry">
...   <h3 class="foo">
...     <a href="http://www.example.com/blog-entry-slug"
...     rel="bookmark">Blog Entry</a>
...   </h3>
...   ...
... </div>
... ''')
>>> soup.find('div', 'entry').find(rel='bookmark').text
u'Blog Entry'

Alternately,

>>> for entry in soup('div', 'entry'):
...     for bookmark in entry(rel='bookmark'):
...         print bookmark.text
...
Blog Entry

You can also use attrs to effect a selector of .entry rather than div.entry:

>>> for entry in soup(attrs={'class': 'entry'}):
...     for bookmark in entry(rel='bookmark'):
...         print bookmark.text
...
Blog Entry

(Note calling the soup or part of the soup is equivalent to .findAll().)

As a list comprehension, that's [b.text for e in soup('div', 'entry') for b in e(rel='bookmark')] (produces [u'Blog Entry']).

If you are wanting real CSS3 selectors, I'm not aware of any such thing for BeautifulSoup. All (or if not quite, almost all) of it can be done with simple nesting, conditions and regular expressions (you could just as well use entry(rel=re.compile('^bookmark$'))). If you want something like that, consider it your next project! It could be useful for flattening code and making it more understandable to web people.

Chris Morgan
  • 86,207
  • 24
  • 208
  • 215
  • That doesn't look too bad. The problem I'm having with BeautifulSoup is that I have to re-learn the interface every time I use it. I use jQuery much more frequently, which is why I'm looking for something similar. – thebjorn Dec 13 '10 at 12:52
  • 1
    It's really pretty simple. Mostly all you'll want is `findAll(tag_name, class_name, attr1=value)`, etc., with the values being `None` for not set, `True` for set, a `str` for a value or a regular expression from `re.compile`. Then just use normal Python iteration structures. It is different from CSS selectors, but it's not hard to understand and remember and offers more power in some situations. – Chris Morgan Dec 13 '10 at 12:59
0

BeautifulSoup allows (basic) CSS selectors: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors

But, in the docs they refer to lxml (http://lxml.de/) if you need more elaborate CSS selectors.

corpaul
  • 21
  • 2