0

Using Python and BS4 I'm trying to scrape birthdates out of Wikipedia articles. They are written as

<span class="bday">1999-12-31</span>

So I used

page.find("span",{"class":"bday"})

to find all birthdates from a list of Wikipedia articles. But unfortunately that also returned the following tag:

<span class="bday dtstart published updated">1999</span>

Now I'm confused. I would expect "bday" to only give me results consisting exactly of that. If I wanted to search everything containing 'bday' I would use something like "bday*". So how can I avoid that? How can I tell BeautifulSoup to only give me results exactly matching "bday".

  • That's not how the `class` selector works; the `class` attribute is a list of class names, and BeautifulSoup, like CSS, gives you all elements where the `class` list contains the class name you are searching for. – Martijn Pieters Apr 27 '20 at 16:48
  • In general, I'd personally would try to find a more specific CSS query for the element; perhaps there is a parent element with a class or id that'll let you narrow down the search, and use `soup.select_one()` with such a CSS query rather than `soup.find()`. E.g. a Wikipedia `bday` span is usually found in the `biography` table, so `soup.select_one('.biography span.bday')` is probably going to be much more useful. – Martijn Pieters Apr 27 '20 at 16:55
  • And finally, consider parsing the raw MediaWiki source instead of the HTML output for Wikipedia pages; see [this example](https://stackoverflow.com/questions/29725163/scraping-part-of-a-wikipedia-infobox/29725192#29725192); look for the `| birth_date` line. – Martijn Pieters Apr 27 '20 at 17:02
  • Thank you for your fast answers, Martijn. For the last point: I might do that as well. But this project is also meant as an exercise for me so I can learn to scrape from HTML code. But thx for the tip. Your second I'm afraid might not work because the "wrong" bday might be in the same table structure. But I will look closer into that. – Nikolai Pardon Apr 27 '20 at 17:47
  • it's basically the [hCard microformat](https://en.wikipedia.org/wiki/HCard), so `soup.select_one('.vcard .bday')` *should* be sufficient. – Martijn Pieters Apr 27 '20 at 17:55
  • Nope, not working - same result. – Nikolai Pardon Apr 27 '20 at 19:00
  • But I found a solution here. Someone else already had a very similar problem. [link]https://stackoverflow.com/questions/22726860/beautifulsoup-webscraping-find-all-finding-exact-match – Nikolai Pardon Apr 27 '20 at 19:02

0 Answers0