1

I am looking to retrieve the XML of a Wikipedia page using their api. The URL I'm using is the following: http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xml&redirects&titles=dog

I've seen this, but it hasn't helped. No matter what I do, I'm not actually getting anything returned to $c, and I can't figure out why. I can do file_get_contents with a plain text file, and it works just fine. Can anyone else verify that this works?

<?php
$url = 'http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xml&redirects&titles=Main%20Page';
$c = file_get_contents($url);
echo $c;
?>

EDIT I have also tried the cURL available on that page, which also doesn't work:

$url = 'http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xml&redirects&titles=Main%20Page';
$ch = curl_init($url);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
$c = curl_exec($ch);
echo $c;
Community
  • 1
  • 1
cryptic_star
  • 1,863
  • 3
  • 26
  • 47
  • 1
    perhaps urls are disabled on file_get_contents by your hosting company, have you tried curl instead? – Twelve47 Apr 12 '11 at 15:30
  • `Warning: file_get_contents(http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xml&redirects&titles=Main%20Page) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.0 403 Forbidden in [file]` – Karl Andrew Apr 12 '11 at 15:33
  • I've tried the curl too, which I also couldn't get to work. I've posted it above for reference. – cryptic_star Apr 12 '11 at 15:36

1 Answers1

4

wikipedia requires you specify a descriptive user agent, by doing something like this:

<?php
$url = 'http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xml&redirects&titles=Main%20Page';
$ch = curl_init($url);
curl_setopt ($ch, CURLOPT_USERAGENT, "MyCoolTool (+http://example.com/MyCoolToolPage/)");
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
$c = curl_exec($ch);
echo $c;
?>

You should use a user-agent string that describes your site, and you shouldn't spoof a web browser's user agent or you may be blocked for appearing suspicious (source: WikiMedia User-Agent policy)

Twelve47
  • 3,924
  • 3
  • 22
  • 29
  • 1
    Do not use a browser user agent, or you are liable to get your IP address banned by the sysadmins. Use something that identifies your program and contains your email or website address. See [Wikimedia's User-Agent policy](http://meta.wikimedia.org/wiki/User-Agent_policy) for details. – Anomie Apr 12 '11 at 16:40
  • 1
    @Anomie, thanks. I've updated my answer to take that into account. – Twelve47 Apr 12 '11 at 21:06