2

I have searched around but not gotten much help. Here's my problem. I want to start from a portal page on wikipedia, say Computer_science and go its categories pages. There are some pages in that category and there are links to subcategories. I will visit some of these pages and get the page abstracts alone. Then go to the next level with pointers from this categories page and so on.

I know C++/php/js/python. Which fits best here? I'd like to do this in a day. I know there's an api, but it doesn't seem helpful for getting content.

  1. I need to get pages
  2. Parse them to get to the categories div (or element as provided by raw wiki data) for getting the abstracts as well as going to other pages.

I need suggestions for programming languages, libraries and public code if available. I also heard wiki don't like bot crawlers, I am planning to get may 500 docs at most. Is that a problem?

Thanks a lot

Sanjeev Satheesh
  • 424
  • 5
  • 17
  • @Sanjeev Satheesh It can be done with regexes. If it isn't too complicated, it maybe done relativelay rapidly. I go to Wikipedia and your links to see what you want and study the problem – eyquem Mar 08 '11 at 11:26
  • @Sanjeev Satheesh How do we go from the portal Computer_science to its categories page ? What is the link in the portal page to go to its categories page ? – eyquem Mar 08 '11 at 11:34
  • Scroll down the portals page. THere is a block called `Categories` with links. – Sanjeev Satheesh Mar 08 '11 at 11:36
  • @Sanjeev Satheesh I had seen the Category block. But I hadn't try the link under the name "Competitions". OK now – eyquem Mar 08 '11 at 12:30
  • @Sanjeev Satheesh I wrote a little code to fetch the content of the portal's page but there is something special with Wikipedia: I get _'Error: ERR_ACCESS_DENIED, errno [No Error] at Tue, 08 Mar 2011 12:27:13 GMT\n'_ as line 86. It reminds me that I once saw a code that was fetching Wikipedia's pages, it used a special request to get the right source code. I don't remind the required request that must be send – eyquem Mar 08 '11 at 12:36
  • if you are using a script you should identify as some kind of browser, I think. Apparently, wiki does not like random bots. You either need to be registered as a bot or identify yourself as a browser – Sanjeev Satheesh Mar 08 '11 at 12:39
  • 3
    You **must not** have your script identify as some kind of browser. See [the Wikimedia User-Agent policy](http://meta.wikimedia.org/wiki/User-Agent_policy). Read-only bots that don't query so much as to be noticed by the sysadmins are ok. – Anomie Mar 08 '11 at 16:12

4 Answers4

3

There isn't necessarily a category corresponding to a portal, although you could try looking for a category with the same name as the portal, the categories the portal page is in (using the API, you can query this with prop=categories), or the category pages linked from the portal page (prop=links&plnamespace=14).

Any of those languages would work. You could also pick perl, java, C#, objective-c, or just about any other language. A list of frameworks of varying quality can be found here or here.

The API can certainly give you content, using prop=revisions. You can even query just the "lead" section with rvsection=0. The API can also give you the list of pages in a category with list=categorymembers and the list of categories for a page using prop=categories.

500 pages shouldn't be an issue. If you were to be wanting a significant proportion of the articles, you'd want to look into using a database dump instead.

See the API documentation for details.

Anomie
  • 92,546
  • 13
  • 126
  • 145
  • +1 Using the API is definitely the way to go. (removed my answer) – Shawn Chin Mar 08 '11 at 12:18
  • awesome! Thanks. I was finding it hard to pour through the api doc. So I can get the lead section alone too! cool. Could you post 2 full api calls for me please, one for getting the leads and another for getting the categories. I keep missing one of the params and get an error. – Sanjeev Satheesh Mar 08 '11 at 12:30
  • @Sanjeev Satheesh Using the API might be the better solution to get the information of Wikipedia. Even for someone who would know how to get the source code of a Wikipedia's page. I presently don't know any of the two ways – eyquem Mar 08 '11 at 12:42
  • 2
    It's actually possible to get both the lead section and the categories list for a page in [one query](http://en.wikipedia.org/w/api.php?action=query&titles=Computer+programming&prop=revisions|categories&rvprop=content&rvsection=0&cllimit=max). But if you want separate queries, here they are for the [lead](http://en.wikipedia.org/w/api.php?action=query&titles=Computer+programming&prop=revisions&rvprop=content&rvsection=0) and [cats](http://en.wikipedia.org/w/api.php?action=query&titles=Computer+programming&prop=categories&cllimit=max) – Anomie Mar 08 '11 at 16:07
  • Like [this](http://en.wikipedia.org/w/api.php?action=query&list=categorymembers&cmtitle=Category:Computer+science&cmnamespace=14&cmlimit=max) then. Do be aware that the category structure is not a tree or even an acyclic graph, and that semantic drift can be a problem. Also, if you are querying large categories (i.e. more than 500 members of any type) be sure to read [the API manual section about continuing queries](http://www.mediawiki.org/wiki/API:Query#Continuing_queries). – Anomie Mar 08 '11 at 16:56
1

Python, have fun with scraping the page, for this I would suggest xpath via lxml.html.

Jakob Bowyer
  • 33,878
  • 8
  • 76
  • 91
1

Although you are looking for a web crawler based solution, let me suggest you to take a look at DBPedia. Essentially it's Wikipedia in RDF format. You can download entire database dumps, run SPARQL queries against it, or directly point to a resource and start exploring from there by walking the references.
For example, the Computer science category can be accessed at this URL:

http://dbpedia.org/page/Category:Computer_science

Zsolt Török
  • 10,289
  • 2
  • 26
  • 26
1

I will suggest python for fast development. You have to have two modules. One will crawl all the possible categories Inside category(basically a category tree), other which can Extract info from the details page(I.e normal wiki page) Wikipedia supports special:export in the URL param which will allow you to get the XML response. Use python's xpath Module will help you.

Nizam
  • 11
  • 1