What would be the easiest way to get all articles about people from Wikipedia? I know I can download a dump of all the pages, but then how do I filter those and get only the ones about people? I need as many as I can get (preferably more than a million) so using any sort of API is probably not an option.
-
I really don't know what you're asking for, aside from over a million Wikipedia articles about people (which isn't a suitable topic for SO). – David Thornley Oct 25 '10 at 17:38
-
What exactly do you mean? Are you asking for advice on how to implement a web spider? – Adrian Grigore Oct 25 '10 at 18:02
-
No, I don't think spidering is appropriate in this case. It's possible to download a dump file of wikipedia. The question is how to filter the dump file XML and get only the pages which are about people. – Johnny Oct 26 '10 at 08:11
3 Answers
Since articles about people usually contain the Persondata template, you can just search for all articles that contain Persondata. You can find a sample API query for doing just that here:
Does the Wikipedia API support searches for a specific template?

- 1
- 1

- 22,552
- 10
- 68
- 92
As of 2014 you have another option: Query WikiData for all entities where the property instance of
(P31) has the value human
(Q5).
Full list of humans: https://www.wikidata.org/wiki/Special:WhatLinksHere/Q5
From that list, filter out any thing that doesn't have a sex or gender
(P21), to get rid of pages like “scientist”
This way, you don't need to keep track of what templates are used for people in each and every different language edition (there are 285) of Wikipedia.

- 8,106
- 7
- 48
- 80
If you are going to roll out on your own, basically what you need is to focus is on the "infobox data" in the XML dump.
Reference: http://code.google.com/p/infobox2rdf/
Or you can also checkout the http://www.freebase.com or http://dbpedia.org

- 2,084
- 12
- 15