11

I've been up and down the Wikipedia API, but I can't figure out if there's a nice way to fetch the excerpt of an article (usually the first paragraph). It would be nice to get the HTML formatting of that paragraph, too.

The only way I currently see of getting something that resembles a snippet is by performing a fulltext search (example), but that's not really what I want (too short).

Is there any other way to fetch the first paragraph of a Wikipedia article than barbarically parsing HTML/WikiText?

Damjan Pavlica
  • 31,277
  • 10
  • 71
  • 76
Felix
  • 88,392
  • 43
  • 149
  • 167
  • Here's a serverless example that fetches the first N characters from a random Wikipedia article. It's not exactly what you want but may help: http://stackoverflow.com/q/15293680/589059 – rkagerer Mar 08 '13 at 12:11

4 Answers4

6

Use this link to get the unparsed intro in xml form "http://en.wikipedia.org/w/api.php?format=xml&action=query&prop=extracts&exsentences=10&titles=Aati kalenja"

Earlier I could get the introduction of a list of topics/articles from a category in a single page by adding iframes with src like the above link.. But now chrome is throwing this error - "Refused to display document because display forbidden by X-Frame-Options." Any way through? Pls help..

ARAVIND VR
  • 61
  • 1
  • 1
  • 4
    Your second paragraph sounds like a question rather than an answer. If you want answers to it, you should post it as a new question. Still, +1 for mentioning `prop=extracts` in your first paragraph. (I just posted a slightly more detailed description of it below.) – Ilmari Karonen Sep 09 '12 at 09:34
  • You can also add the exintro attribute to get only the introduction : http://fr.wikipedia.org/w/api.php?format=xml&action=query&prop=extracts&exintro&titles=Margaret_Thatcher – CedricSoubrie Apr 26 '13 at 10:08
3

As ARAVIND VR notes, on wikis running the MobileFrontend extension — which includes Wikipedia — you can easily get an excerpt of an article via the MediaWiki API by using the prop=extracts API query.

For example, this link will give you a short excerpt of the Stack Overflow article on Wikipedia in a JSON wrapper.

The various options to the query can be used to control the excerpt format (HTML or plain text), its maximum length (in characters and/or sentences, and optionally restricting it to the intro section of the article) and the formatting of section headings in the output. It's also possible to obtain intro extracts from more than one article in a single query.

Ilmari Karonen
  • 49,047
  • 9
  • 93
  • 153
3

I found no way of doing this through the API, so I resorted to parsing HTML, using PHP's DOM functions. This was pretty easy, something among the lines of:

$doc = new DOMDocument();
$doc->loadHTML($wikiPage);
$xpath = new DOMXpath($doc);
$nlPNodes = $xpath->query('//div[@id="bodyContent"]/p');
$nFirstP = $nlPNodes->item(0);
$sFirstP = $doc->saveXML($nFirstP);
echo $sFirstP; // echo the first paragraph of the wiki article, including <p></p>
Felix
  • 88,392
  • 43
  • 149
  • 167
  • You should use the `action=render` url parameter; that way you need to load less stuff. Also, the excerpt is typically not the first paragraph but anything up to the first `

    `.

    – Tgr Dec 19 '10 at 15:29
2

It's possible to get only the "introduction" of the article using the API, with the parameter rvsection=0 as explained here.

Converting Wiki-text to HTML is a bit more difficult; I guess there are more complete/official methods, but this is what I ended up doing:

// remove templates (even nested)
do {
    $c = preg_replace('/[{][{][^{}]+[}][}]\n?/', '', $c, -1, $count);
} while ($count > 0);
// remove HTML comments
$c = preg_replace('/<!--(?:[^-]|-[^-]|[[[^>])+-->\n?/', '', $c);
// remove links
$c = preg_replace('/[[][[](?:[^]|]+[|])?([^]]+)[]][]]/', '$1', $c);
$c = preg_replace('/[[]http[^ ]+ ([^]]+)[]]/', '$1', $c);
// remove footnotes
$c = preg_replace('#<ref(?:[^<]|<[^/])+</ref>#', '', $c);
// remove leading and trailing spaces
$c = trim($c);
// convert bold and italic
$c = preg_replace("/'''((?:[^']|'[^']|''[^'])+)'''/", $html ? '<b>$1</b>' : '$1', $c);
$c = preg_replace("/''((?:[^']|'[^'])+)''/", $html ? '<i>$1</i>' : '$1', $c);
// add newlines
if ($html) $c = preg_replace('/(\n)/', '<br/>$1', $c);
lapo
  • 3,136
  • 26
  • 34