2

I'm using the wikipedia API to get the infoboxes from certain pages. An example would be Imperial College London My problem is the HESA student population|INSTID=0132 value that I'm getting. I was hoping to just get the number for student population but instead I'm getting the id above. How can I get the values of the infoboxes present in a page?

Moreover if you check the wiki page there are two infoboxes (main and rankings). How can I get both of them?

Alkis Kalogeris
  • 17,044
  • 15
  • 59
  • 113
  • 1
    See [How do you extract information from a Wikipedia infobox?](http://stackoverflow.com/questions/33862336/how-do-you-extract-information-from-a-wikipedia-infobox/33862337#33862337) – Tgr Apr 07 '16 at 14:45
  • Yes I've read that. The wikitext is just uparsable. I've used some npm libraries but nothing robust. Some of the values are not present (e.g. the one I'm referring to in my question). The api that returns html with classes is perfectly fine. Still there are problems, but with some tweaking in the parsing I can overcome those. With this question I wanted to know if there is some functionality that I was missing. Nothing returns pure infobox, but the new api is fast and has all the info I need. – Alkis Kalogeris Apr 07 '16 at 15:22
  • If you have read that then surely you have looked at [DBPedia](http://dbpedia.org/page/Imperial_College_London)? – Tgr Apr 07 '16 at 15:56
  • Yes. Unfortunately not all the values are present. – Alkis Kalogeris Apr 07 '16 at 16:01

1 Answers1

2

There's an alternative REST API you could use to access wikipedia content. To get the well-structured HTML for an article you would request:

https://en.wikipedia.org/api/rest_v1/page/html/Imperial_College_London

The HTML is produced by the Parsoid service which produced HTML/RDFa content following the DOM Spec. Inboxes will be html table element with class `infobox, so you could easily locate all inboxes on the page.

Inboxes are normally created by complex templates, so it might be easier for you to just parse the table HTML.

Petr
  • 5,999
  • 2
  • 19
  • 25
  • Hello @Petr. Thanks for the response. I can't use this api yet (although it's so much cleaner) since it's still in beta. I can do the same thing though with https://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvlimit=1&rvprop=content&rvparse=&titles=Imperial%20College%20London. Which is using the current/stable api and all the styling is present so I could use the same logic (parsing). I've thought about this, but I was hoping that there is a better/cleaner/faster way to do this. – Alkis Kalogeris Apr 06 '16 at 20:47
  • 2
    @alkis I'm the developer of this API, so I can assure you that the 'beta' status will not be a problem for you. The `/page/html` endpoints are very stable now and used by several major clients both inside and outside wikimedia. VisualEditor, Android app, content translation tool and other features rely on this API. – Petr Apr 06 '16 at 20:58
  • That's great news. Is it faster as well? It seems to be a lot faster – Alkis Kalogeris Apr 06 '16 at 21:01
  • 2
    It should be much faster than PHP API because these API is cached by Varnish, so you have a great change of getting cached content. While the PHP API is not cached at all. – Petr Apr 06 '16 at 21:06