10

I'm trying to write a python program that can search wikipedia for the birth and death dates for people.

For example, Albert Einstein was born: 14 March 1879; died: 18 April 1955.

I started with Fetch a Wikipedia article with Python

import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
infile = opener.open('http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&rvsection=0&titles=Albert_Einstein&format=xml')
page2 = infile.read()

This works as far as it goes. page2 is the xml representation of the section from Albert Einstein's wikipedia page.

And I looked at this tutorial, now that I have the page in xml format... http://www.travisglines.com/web-coding/python-xml-parser-tutorial, but I don't understand how to get the information I want (birth and death dates) out of the xml. I feel like I must be close, and yet, I have no idea how to proceed from here.

EDIT

After a few responses, I've installed BeautifulSoup. I'm now at the stage where I can print:

import BeautifulSoup as BS
soup = BS.BeautifulSoup(page2)
print soup.getText()
{{Infobox scientist
| name        = Albert Einstein
| image       = Einstein 1921 portrait2.jpg
| caption     = Albert Einstein in 1921
| birth_date  = {{Birth date|df=yes|1879|3|14}}
| birth_place = [[Ulm]], [[Kingdom of Württemberg]], [[German Empire]]
| death_date  = {{Death date and age|df=yes|1955|4|18|1879|3|14}}
| death_place = [[Princeton, New Jersey|Princeton]], New Jersey, United States
| spouse      = [[Mileva Marić]] (1903–1919)<br>{{nowrap|[[Elsa Löwenthal]] (1919–1936)}}
| residence   = Germany, Italy, Switzerland, Austria, Belgium, United Kingdom, United States
| citizenship = {{Plainlist|
* [[Kingdom of Württemberg|Württemberg/Germany]] (1879–1896)
* [[Statelessness|Stateless]] (1896–1901)
* [[Switzerland]] (1901–1955)
* [[Austria–Hungary|Austria]] (1911–1912)
* [[German Empire|Germany]] (1914–1933)
* United States (1940–1955)
}}

So, much closer, but I still don't know how to return the death_date in this format. Unless I start parsing things with re? I can do that, but I feel like I'd be using the wrong tool for this job.

Community
  • 1
  • 1
JBWhitmore
  • 11,576
  • 10
  • 38
  • 52
  • An XML parser won't help you further. Read what JBernardo says: fetch data in json format and use a dedicated MW parser. – georg Sep 03 '12 at 15:53
  • I have attached complete code both with/without using `re` to parse it. – K Z Sep 03 '12 at 18:19
  • Please, don't try to impersonate a browser by your User-Agent. According to [the Wikimedia User-Agent policy](http://meta.wikimedia.org/wiki/User-Agent_policy), you should use “an informative User-Agent string with contact information”. – svick Sep 03 '12 at 22:14

6 Answers6

8

You can consider using a library such as BeautifulSoup or lxml to parse the response html/xml.

You may also want to take a look at Requests, which has a much cleaner API for making requests.


Here is the working code using Requests, BeautifulSoup and re, arguably not the best solution here, but it is quite flexible and can be extended for similar problems:

import re
import requests
from bs4 import BeautifulSoup

url = 'http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&rvsection=0&titles=Albert_Einstein&format=xml'

res = requests.get(url)
soup = BeautifulSoup(res.text, "xml")

birth_re = re.search(r'(Birth date(.*?)}})', soup.revisions.getText())
birth_data = birth_re.group(0).split('|')
birth_year = birth_data[2]
birth_month = birth_data[3]
birth_day = birth_data[4]

death_re = re.search(r'(Death date(.*?)}})', soup.revisions.getText())
death_data = death_re.group(0).split('|')
death_year = death_data[2]
death_month = death_data[3]
death_day = death_data[4]

Per @JBernardo's suggestion using JSON data and mwparserfromhell, a better answer for this particular use case:

import requests
import mwparserfromhell

url = 'http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&rvsection=0&titles=Albert_Einstein&format=json'

res = requests.get(url)
text = res.json["query"]["pages"].values()[0]["revisions"][0]["*"]
wiki = mwparserfromhell.parse(text)

birth_data = wiki.filter_templates(matches="Birth date")[0]
birth_year = birth_data.get(1).value
birth_month = birth_data.get(2).value
birth_day = birth_data.get(3).value

death_data = wiki.filter_templates(matches="Death date")[0]
death_year = death_data.get(1).value
death_month = death_data.get(2).value
death_day = death_data.get(3).value
K Z
  • 29,661
  • 8
  • 73
  • 78
  • 1
    Did you even checked the data to see if a HTML/XML parser will help? hint: It will not – JBernardo Sep 03 '12 at 15:46
  • @JBernardo You are right, the contents are in the same XML tag. Though it seems like JSON format has the same problem. I think one of the parser you suggested would parse the data inside the tag? – K Z Sep 03 '12 at 16:05
  • @KayZhu so you realize the real data he wants to parse is the Wiki format? The use of JSON is to make it easier to reach the Wiki data (because JSON is much simpler than XML) – JBernardo Sep 03 '12 at 16:06
  • @JBernardo Yes you are right, it seems though the parsers in the link will work well with either. – K Z Sep 03 '12 at 16:11
5

First: The wikipedia API allows the use of JSON instead of XML and that will make things much easier.

Second: There's no need to use HTML/XML parsers at all (the content is not HTML nor the container need to be). What you need to parse is this Wiki format inside "revisions" tag of the JSON.

Check some Wiki parsers here


What seems to be confusing here is that the API allows you to request a certain format (XML or JSON) but that's is just a container for some text in the real format you want to parse:

This one: {{Birth date|df=yes|1879|3|14}}

With one of the parsers provided in the link above, you will be able to do that.

JBernardo
  • 32,262
  • 10
  • 90
  • 115
  • OK, so I can read it in as JSON: `infile = opener.open('http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&rvsection=0&titles=Albert_Einstein&format=json')` Looking at the Wiki parsers you linked to, I see plenty that are XML/HTML but no JSON listed. Do you have a recommended one? – JBWhitmore Sep 03 '12 at 15:55
  • @JBWhitmore the `json` module comes with Python. It is **just** a container for the real data you want to parse. This data is not in XML or HTML or JSON. It is in some specific Wiki format – JBernardo Sep 03 '12 at 16:03
  • @JBWhitmore You want to parse this kind of data: `{{Birth date|df=yes|1879|3|14}}` and one of the modules in the link will help you. – JBernardo Sep 03 '12 at 16:04
  • This is a better answer than mine for this specific case. Have an upvote :) – K Z Sep 03 '12 at 16:08
  • Look, I appreciate that you guys know what's going on. Meanwhile, I have a blinking cursor in my terminal and a variable with Wiki formatted JSON data in it. And yes, I would like to parse exactly `{{Birth date|df=yes|1879|3|13}}` -- but that's what I'm asking you: how do I do that? – JBWhitmore Sep 03 '12 at 16:11
  • @JBWhitmore one of the parsers from the link above: [mwparserfromhell](https://github.com/earwig/mwparserfromhell/). I don't know if it is the better but read the "Usage" from this link and try to understant what it does and why it will parse what you want – JBernardo Sep 03 '12 at 16:29
5

First, use pywikipedia. It allows you to query article text, template parameters etc. through a high-level abstract interface. Second, I would go with the Persondata template (look towards the end of the article). Also, in the long term, you might be interested in Wikidata, which will take several months to introduce, but it will make most metadata in Wikipedia articles easily queryable.

Tgr
  • 27,442
  • 12
  • 81
  • 118
2

I came across this question and appreciated all the useful information that was provided in @Yoshiki's answer, but it took some synthesizing to get to a working solution. Sharing here in case it's useful for anyone else. The code is also in this gist for those who wish to fork / improve it.

In particular, there's not much in the way of error handling here ...

import csv
from datetime import datetime
import json
import requests
from dateutil import parser


def id_for_page(page):
    """Uses the wikipedia api to find the wikidata id for a page"""
    api = "https://en.wikipedia.org/w/api.php"
    query = "?action=query&prop=pageprops&titles=%s&format=json"
    slug = page.split('/')[-1]

    response = json.loads(requests.get(api + query % slug).content)
    # Assume we got 1 page result and it is correct.
    page_info = list(response['query']['pages'].values())[0]
    return  page_info['pageprops']['wikibase_item']


def lifespan_for_id(wikidata_id):
    """Uses the wikidata API to retrieve wikidata for the given id."""
    data_url = "https://www.wikidata.org/wiki/Special:EntityData/%s.json"
    page = json.loads(requests.get(data_url % wikidata_id).content)

    claims = list(page['entities'].values())[0]['claims']
    # P569 (birth) and P570 (death) ... not everyone has died yet.
    return [get_claim_as_time(claims, cid) for cid in ['P569', 'P570']]


def get_claim_as_time(claims, claim_id):
    """Helper function to work with data returned from wikidata api"""
    try:
        claim = claims[claim_id][0]['mainsnak']['datavalue']
        assert claim['type'] == 'time', "Expecting time data type"

        # dateparser chokes on leading '+', thanks wikidata.
        return parser.parse(claim['value']['time'][1:])
    except KeyError as e:
        print(e)
        return None


def main():
    page = 'https://en.wikipedia.org/wiki/Albert_Einstein'

    # 1. use the wikipedia api to find the wikidata id for this page
    wikidata_id = id_for_page(page)

    # 2. use the wikidata id to get the birth and death dates
    span = lifespan_for_id(wikidata_id)

    for label, dt in zip(["birth", "death"], span):
        print(label, " = ", datetime.strftime(dt, "%b %d, %Y"))
Jason Sundram
  • 12,225
  • 19
  • 71
  • 86
1

The persondata template is deprecated now, and you should instead access Wikidata. See Wikidata:Data access. My earlier (now deprecated) answer from 2012 was as follows:

What you should do is to parse the {{persondata}} template found in most biographical articles. There are existing tools for easily extracting such data programmatically, with your existing knowledge and the other helpful answers I am sure you can make that work.

  • For what it's worth, in case it saves someone else a click later, Persondata appears to now be deprecated. The link says that it, "…has now been removed. From now on, such data should be added, with a citation, to Wikidata instead." – Matt V. Jul 09 '17 at 07:44
1

One alternative in 2019 is to use the Wikidata API, which, among other things, exposes biographical data like birth and death dates in a structured format that is very easy to consume without any custom parsers. Many Wikipedia articles depend on Wikidata for their info, so in many cases this will be the same as if you were consuming Wikipedia data.

For example, look at the Wikidata page for Albert Einstein and search for "date of birth" and "date of death", you will find they are the same as in Wikipedia. Every entity in Wikidata has a list of "claims" which are pairs of "properties" and "values". To know when Einstein was born and died, we only need to search the list of statements for the appropriate properties, in this case, P569 and P570. To do this programatically, it's best to access the entity as json, which you can do with the following url structure:

https://www.wikidata.org/wiki/Special:EntityData/Q937.json

And as an example, here is what the claim P569 states about Einstein:

        "P569": [
          {
            "mainsnak": {
              "property": "P569",
              "datavalue": {
                "value": {
                  "time": "+1879-03-14T00:00:00Z",
                  "timezone": 0,
                  "before": 0,
                  "after": 0,
                  "precision": 11,
                  "calendarmodel": "http://www.wikidata.org/entity/Q1985727"
                },
                "type": "time"
              },
              "datatype": "time"
            },
            "type": "statement",

You can learn more about accessing Wikidata in this article, and more specifically about how dates are structured in Help:Dates.

Yoshiki
  • 995
  • 9
  • 9
  • I am also looking to extract birth/death dates - I started out via the wikipedia/beautiful soup route like the OP but have found Yoshiki's suggestion to use Wikidata much easier. This article gives practical examples on using WikiData and was very helpful for me https://medium.com/freely-sharing-the-sum-of-all-knowledge/writing-a-wikidata-query-discovering-women-writers-from-north-africa-d020634f0f6c – sally2000 Sep 17 '20 at 18:18