1

Possible Duplicate:
Getting content using wikipedia API
Using PHP, how do I get the first paragraph of a Wikipedia article with the MediaWiki API?

This is mainly an XML-related question.

I'm trying to do this using the MediaWiki API.

I've managed to get a response in XML format (can change to JSON if easier), and I see all the content I need in the response. Example:

http://en.wikipedia.org/w/api.php?format=xml&action=query&titles=War%20and%20Peace&prop=revisions&rvprop=content&format=xmlfm

I used xmlfm here for formatting reasons. In PHP I'm doing:

$request = "http://en.wikipedia.org/w/api.php?format=xml&action=query&titles=War%20and%20Peace&prop=revisions&rvprop=content&format=xml";

$response = @file_get_contents($request);

$wxml = simplexml_load_string($response);

var_dump($wxml);

Which prints out everything in the XML. My question is, how do I get the first paragraph out of this?

I can parse it from the full article, so basically what I'm asking is, how do I get the article text from this XML? Of course, if there's a way to go for the first paragraph directly, that would be best.

Community
  • 1
  • 1
sveti petar
  • 3,637
  • 13
  • 67
  • 144
  • You could use the `xmlrpc_decode()` function. –  Jun 09 '12 at 12:25
  • Do you want the first paragraph or the bit before the contents box? – Cameron Martin Jun 09 '12 at 12:31
  • 3
    [Please](http://stackoverflow.com/questions/8555320/is-there-a-clean-wikipedia-api-just-for-retrieve-content-summary) [search](http://stackoverflow.com/questions/6128168/php-wikipedia-get-content-from-the-first-paragraph-in-a-wikipedia-article) [before](http://stackoverflow.com/questions/9389699/retrieve-first-paragraph-of-wikipedia-article) [asking](http://stackoverflow.com/questions/9381233/using-php-how-do-i-get-the-first-paragraph-of-a-wikipedia-article-with-the-medi), [thanks](http://stackoverflow.com/questions/2799887/how-to-scrape-the-first-paragraph-from-a-wikipedia-page). – salathe Jun 09 '12 at 12:32
  • http://stackoverflow.com/search?q=%5Bphp%5D+wikipedia+first+paragraph – salathe Jun 09 '12 at 12:33
  • Remember, `api` is your root element, so take it from `$wxml->query->..` and so on. And use the `rvsection` as mentioned in all the other answers on all those duplicate queries.. – Wrikken Jun 09 '12 at 12:37
  • @salathe The first link did the trick, thanks. For the record, I did search...a little. – sveti petar Jun 09 '12 at 12:38

1 Answers1

5

I'd definitely say you're looking for this.

If you want to retrieve everything in the first section (not just the first paragraph):

// action=parse: get parsed text
// page=Baseball: from the page Baseball
// format=json: in json format
// prop=text: send the text content of the article
// section=0: top content of the page

$url = 'http://en.wikipedia.org/w/api.php?action=parse&page=Baseball&format=json&prop=text&section=0';
$ch = curl_init($url);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_USERAGENT, "TestScript"); // required by wikipedia.org server; use YOUR user agent with YOUR contact information. (otherwise your IP might get blocked)
$c = curl_exec($ch);

$json = json_decode($c);

$content = $json->{'parse'}->{'text'}->{'*'}; // get the main text content of the query (it's parsed HTML)

// pattern for first match of a paragraph
$pattern = '#<p>(.*?)</p>#s'; // http://www.phpbuilder.com/board/showthread.php?t=10352690
if(preg_match_all($pattern, $content, $matches))
{
    // print $matches[0]; // content of the first paragraph (including wrapping <p> tag)
    print strip_tags(implode("\n\n",$matches[1])); // Content of the first paragraph without the HTML tags.
}
Community
  • 1
  • 1
Cameron Martin
  • 5,952
  • 2
  • 40
  • 53