Getting wikipedia abstracts only

Question

I have searched around but not gotten much help. Here's my problem. I want to start from a portal page on wikipedia, say Computer_science and go its categories pages. There are some pages in that category and there are links to subcategories. I will visit some of these pages and get the page abstracts alone. Then go to the next level with pointers from this categories page and so on.

I know C++/php/js/python. Which fits best here? I'd like to do this in a day. I know there's an api, but it doesn't seem helpful for getting content.

I need to get pages
Parse them to get to the categories div (or element as provided by raw wiki data) for getting the abstracts as well as going to other pages.

I need suggestions for programming languages, libraries and public code if available. I also heard wiki don't like bot crawlers, I am planning to get may 500 docs at most. Is that a problem?

Thanks a lot

@Sanjeev Satheesh It can be done with regexes. If it isn't too complicated, it maybe done relativelay rapidly. I go to Wikipedia and your links to see what you want and study the problem — eyquem, Mar 08 '11 at 11:26
@Sanjeev Satheesh How do we go from the portal Computer_science to its categories page ? What is the link in the portal page to go to its categories page ? — eyquem, Mar 08 '11 at 11:34
Scroll down the portals page. THere is a block called `Categories` with links. — Sanjeev Satheesh, Mar 08 '11 at 11:36
@Sanjeev Satheesh I had seen the Category block. But I hadn't try the link under the name "Competitions". OK now — eyquem, Mar 08 '11 at 12:30
@Sanjeev Satheesh I wrote a little code to fetch the content of the portal's page but there is something special with Wikipedia: I get _'Error: ERR_ACCESS_DENIED, errno [No Error] at Tue, 08 Mar 2011 12:27:13 GMT\n'_ as line 86. It reminds me that I once saw a code that was fetching Wikipedia's pages, it used a special request to get the right source code. I don't remind the required request that must be send — eyquem, Mar 08 '11 at 12:36
if you are using a script you should identify as some kind of browser, I think. Apparently, wiki does not like random bots. You either need to be registered as a bot or identify yourself as a browser — Sanjeev Satheesh, Mar 08 '11 at 12:39
You **must not** have your script identify as some kind of browser. See [the Wikimedia User-Agent policy](http://meta.wikimedia.org/wiki/User-Agent_policy). Read-only bots that don't query so much as to be noticed by the sysadmins are ok. — Anomie, Mar 08 '11 at 16:12

score 3 · Accepted Answer · answered Mar 08 '11 at 12:13

3

There isn't necessarily a category corresponding to a portal, although you could try looking for a category with the same name as the portal, the categories the portal page is in (using the API, you can query this with prop=categories), or the category pages linked from the portal page (prop=links&plnamespace=14).

Any of those languages would work. You could also pick perl, java, C#, objective-c, or just about any other language. A list of frameworks of varying quality can be found here or here.

The API can certainly give you content, using prop=revisions. You can even query just the "lead" section with rvsection=0. The API can also give you the list of pages in a category with list=categorymembers and the list of categories for a page using prop=categories.

500 pages shouldn't be an issue. If you were to be wanting a significant proportion of the articles, you'd want to look into using a database dump instead.

See the API documentation for details.

answered Mar 08 '11 at 12:13

Anomie

92,546
13
126
145

+1 Using the API is definitely the way to go. (removed my answer) – Shawn Chin Mar 08 '11 at 12:18
awesome! Thanks. I was finding it hard to pour through the api doc. So I can get the lead section alone too! cool. Could you post 2 full api calls for me please, one for getting the leads and another for getting the categories. I keep missing one of the params and get an error. – Sanjeev Satheesh Mar 08 '11 at 12:30
@Sanjeev Satheesh Using the API might be the better solution to get the information of Wikipedia. Even for someone who would know how to get the source code of a Wikipedia's page. I presently don't know any of the two ways – eyquem Mar 08 '11 at 12:42
2

It's actually possible to get both the lead section and the categories list for a page in [one query](http://en.wikipedia.org/w/api.php?action=query&titles=Computer+programming&prop=revisions|categories&rvprop=content&rvsection=0&cllimit=max). But if you want separate queries, here they are for the [lead](http://en.wikipedia.org/w/api.php?action=query&titles=Computer+programming&prop=revisions&rvprop=content&rvsection=0) and [cats](http://en.wikipedia.org/w/api.php?action=query&titles=Computer+programming&prop=categories&cllimit=max) – Anomie Mar 08 '11 at 16:07
Like [this](http://en.wikipedia.org/w/api.php?action=query&list=categorymembers&cmtitle=Category:Computer+science&cmnamespace=14&cmlimit=max) then. Do be aware that the category structure is not a tree or even an acyclic graph, and that semantic drift can be a problem. Also, if you are querying large categories (i.e. more than 500 members of any type) be sure to read [the API manual section about continuing queries](http://www.mediawiki.org/wiki/API:Query#Continuing_queries). – Anomie Mar 08 '11 at 16:56

score 1 · Answer 2 · answered Mar 08 '11 at 11:46

1

Python, have fun with scraping the page, for this I would suggest xpath via lxml.html.

answered Mar 08 '11 at 11:46

Jakob Bowyer

33,878
8
76
91

er, the wikitext doesn't come as html unfortunately. Its their custom wiki-text :( – Sanjeev Satheesh Mar 08 '11 at 11:53

Zsolt Török · Answer 3 · 2011-03-08T13:41:58.527

1

Although you are looking for a web crawler based solution, let me suggest you to take a look at DBPedia. Essentially it's Wikipedia in RDF format. You can download entire database dumps, run SPARQL queries against it, or directly point to a resource and start exploring from there by walking the references.
For example, the Computer science category can be accessed at this URL:

http://dbpedia.org/page/Category:Computer_science

edited Mar 08 '11 at 13:41

answered Mar 08 '11 at 12:38

Zsolt Török

10,289
2
26
26

That link just says `no further information available`. Is there any other way of accessing it? – Sanjeev Satheesh Mar 08 '11 at 12:45
@Sanjeev Satheesh: Fixed URL, the colon after Category was for some reason encoded to %3a. Now it points to the right page. – Zsolt Török Mar 08 '11 at 13:46

score 1 · Answer 4 · answered Mar 08 '11 at 14:31

I will suggest python for fast development. You have to have two modules. One will crawl all the possible categories Inside category(basically a category tree), other which can Extract info from the details page(I.e normal wiki page) Wikipedia supports special:export in the URL param which will allow you to get the XML response. Use python's xpath Module will help you.

Getting wikipedia abstracts only

4 Answers4