Beautiful Soup: Accessing
elements from
with no id

Question

I am trying to scrape the people who have birthdays from this Wikipedia page

Here is the existing code:

hdr = {'User-Agent': 'Mozilla/5.0'}
site = "http://en.wikipedia.org/wiki/"+"january"+"_"+"1"
req = urllib2.Request(site,headers=hdr)    
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)

print soup

This all works fine and I get the entire HTML page, but I want specific data, and I don't know how to access that with Beautiful Soup without an id to use. The <ul> tag does not have an id and neither do the <li> tags. Plus, I can't just ask for every <li> tag because there are other lists on the page. Is there a specific way to call a given list? (I can't just use a fix for this one page because I plan on iterating through all the dates and getting every pages birthday, and I can't guarentee that every page is the exact same layout as this one).

you need *some* kind of reference, whether it's positional, id, class, etc. Do you know, for example, of the lists on that page, which number it is? is that consistent? — Colleen, Jul 16 '13 at 17:46
From the birth section on, unambiguously identifiable via `
`, each `
` corresponds to one person (until the next heading). — Dr. Jan-Philip Gehrcke, Jul 16 '13 at 17:48

alecxe · Answer 1 · 2013-07-16T18:00:35.623

The idea is to get the span with Births id, find parent's next sibling (which is ul) and iterate over it's li elements. Here's a complete example using requests (it's not relevant though):

from bs4 import BeautifulSoup as Soup, Tag

import requests


response = requests.get("http://en.wikipedia.org/wiki/January_1")
soup = Soup(response.content)

births_span = soup.find("span", {"id": "Births"})
births_ul = births_span.parent.find_next_sibling()

for item in births_ul.findAll('li'):
    if isinstance(item, Tag):
        print item.text

prints:

871 – Zwentibold, Frankish son of Arnulf of Carinthia (d. 900)
1431 – Pope Alexander VI (d. 1503)
1449 – Lorenzo de' Medici, Italian politician (d. 1492)
1467 – Sigismund I the Old, Polish king (d. 1548)
1484 – Huldrych Zwingli, Swiss pastor and theologian (d. 1531)
1511 – Henry, Duke of Cornwall (d. 1511)
1516 – Margaret Leijonhufvud, Swedish wife of Gustav I of Sweden (d. 1551)
...

Hope that helps.

This recipe works for most summary pages on wikipedia, just change the id value, thanks — MortenB, Sep 02 '17 at 15:43

score 6 · Accepted Answer · answered Jul 16 '13 at 17:46

6

Find the Births section:

section = soup.find('span', id='Births').parent

And then find the next unordered list:

births = section.find_next('ul').find_all('li')

answered Jul 16 '13 at 17:46

Blender

289,723
53
439
496

Beautiful Soup: Accessing
elements from
with no id

`, each `
` corresponds to one person (until the next heading).

2 Answers2

Linked

Beautiful Soup: Accessing elements from with no id

`, each `` corresponds to one person (until the next heading).

2 Answers2

Linked

Beautiful Soup: Accessing
elements from
with no id

`, each `
` corresponds to one person (until the next heading).