2

i am trying to use beautifulsoup to get birthdays for persons from wikipedia. for example the birthday for http://en.wikipedia.org/wiki/Ezra_Taft_Benson is August 4, 1899. to get to the bday, i am using the following code:

bday = url.find("span", class_="bday")

However it is picking up the instance where bday appears in the html code as part of another tag. i.e <span class="bday dtstart published updated">1985-11-10 </span>.

is there a way to match the exact class tag with bday only?

I hope the question is clear as currently I am getting the bday to be 1985-11-10 which is not the correct date.

Pierre GM
  • 19,809
  • 3
  • 56
  • 67
user1496289
  • 1,793
  • 4
  • 12
  • 13

3 Answers3

4

When all other matching methods of BeautifulSoup fail, you can use a function taking a single argument (tag):

>>> url.find(lambda tag: tag.name == 'span' and tag.get('class', []) == ['bday'])
<span class="bday">1899-08-04</span>

The above searches for a span tag whose class attribute is a list of a single element ('bday').

efotinis
  • 14,565
  • 6
  • 31
  • 36
  • this was a great simple solution! thanks. what is the lambda tag doing? – user1496289 Sep 23 '12 at 18:34
  • The `lambda` creates an anonymous function with a single argument (tag). You could define a separate, named function and pass its name to `find()` instead, but for short, one-off functions `lambda` is [more preferable](http://stackoverflow.com/a/890188/12320). – efotinis Sep 23 '12 at 19:27
1

I would have gone about it this way:

import urllib
from BeautifulSoup import BeautifulSoup

url = 'http://en.wikipedia.org/wiki/Ezra_Taft_Benson'
file_pointer = urllib.urlopen(url)
html_object = BeautifulSoup(file_pointer)

bday = html_object('span',{'class':'bday'})[0].contents[0] 

This returns 1899-08-04 as the value of bday

That1Guy
  • 7,075
  • 4
  • 47
  • 59
0

Try using lxml with the beautifulsoup parser. The following finds <span> tags with only the bday class (which in the case of this page there is only one):

>>> from lxml.html.soupparser import fromstring
>>> root = fromstring(open('Ezra_Taft_Benson'))
>>> span_bday_nodes = root.findall('.//span[@class="bday"]')
[<Element span at 0x1be9290>]
>>> span_bday_node[0].text
'1899-08-04'
Pedro Romano
  • 10,973
  • 4
  • 46
  • 50