2

I have a task: get by inputed keyword Wikipedia article, save it to database and then make a search inside them.

The problem is: how to access api and retrieve data from wikipedia, I've tried this url (at the begining i've tried json format):

$url = 'https://en.wikipedia.org/w/api.php?action=query&titles=Dog&prop=revisions&rvprop=content&format=xml';

and this php code:

$ch=curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']); 
$res = curl_exec($ch);
if (!$res) {
    echo 'cURL Error: '.curl_error($ch);
}
var_dump($res);

but nothing happend. Is it possible to access data with curl?

At the end one code worked with url above:

ini_set('user_agent','TestText');
$xmlDoc = new \DOMDocument();
$xmlDoc->load($url);
echo($xmlDoc->saveXML());

and then I get the text like this

{{about|the domestic dog|related species known as "dogs"|Canidae|other uses|Dog (disambiguation)|}} {{Redirect|Doggie|the Danish artist|Doggie (artist)}} {{pp-semi-indef}} {{pp-move-indef}} {{Taxobox | name = Domestic dog | fossil_range = {{Fossil range|0.033|0}}[[Pleistocene]] – [[Recent]] |

How can I handle it to be prettier (text with paragraphes or at liest plain text)?

So, There are two questions: 1. Is it possible to access wiki data with php curl and how I should improve my code? 2. How do I make wiki xml code prettier?

My question about code, especially about curl. Why it doesn't work? And also, answer to another question says only about wikipedia api urls. By only changing url I can't solve problem.

I've found the solution, CURLOPT_SSL_VERIFYPEER was needed:

$url = 'http://en.wikipedia.org/w/api.php?action=query&prop=extracts&format=json&explaintext=&titles=Dog';

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']); 
$res = curl_exec($ch);
//$json_data = mb_substr($res, curl_getinfo($ch, CURLINFO_HEADER_SIZE));
curl_close($ch);
$json = json_decode($res);

$content = $json->query->pages;
$wiki_id = '';
foreach ($content as $key => $value) {
    $wiki_id = $key;
}
echo $content = $content->$wiki_id->extract;
GingerN
  • 95
  • 1
  • 9
  • To get the rendered HTML of a wiki page, you can just append `action=render` to the URL, like this: https://sv.wikipedia.org/wiki/Portal:Huvudsida?action=render – leo Oct 17 '15 at 20:09
  • Possible duplicate of [Get Text Content from mediawiki page via API](http://stackoverflow.com/questions/1625162/get-text-content-from-mediawiki-page-via-api) – leo Oct 17 '15 at 20:10
  • also, flagging as duplicate, as there are already tens of questions about providing the same answers – leo Oct 17 '15 at 20:11
  • But it has no answer, especially about curl. – GingerN Oct 18 '15 at 08:26
  • But curl seem to be working fine for you already? You seem to be getting the wiki text back just as you should, so you just have to replace the URL! – leo Oct 18 '15 at 08:28
  • I wrote that nothing happend with curl. Only \DOMDocument did something. – GingerN Oct 18 '15 at 08:32
  • Ok, what error message does your web server give you then? – leo Oct 18 '15 at 08:35
  • (your code works for me. I suspect you might not have the PHP/curl binding installed, but it's impossible to say without knowing what error messages you get) – leo Oct 18 '15 at 08:43
  • I don't get any error messages, nothing happened. And I've checked: curl is set. – GingerN Oct 18 '15 at 10:33
  • Do any errors at all get written to your log file? What [error level](http://php.net/manual/en/errorfunc.configuration.php) do you use? Note that error messages are not written to the screen, you have to look for them in your error.log file or similar. – leo Oct 18 '15 at 16:49
  • Also, does curl work for other URL's? If not, can you run curl off the command line? – leo Oct 18 '15 at 16:50
  • I'm just confused. There are no errors in log file too. Below is what it gets with google.com url. Should it be like that? 302 Moved

    302 Moved

    The document has moved here.
    – GingerN Oct 19 '15 at 14:35
  • Then I'm out of ideas. Your code works just fine for me, and it appears to be working ok for you with another url. How about fetching the Wikipedia url with curl from the command line, what happens then? – leo Oct 19 '15 at 17:42
  • Nothing. Also I've tried variasions of plain wikipedia.org and nothing happend again. I'm new to curl and with this strange behaviur behan to doubt what I'm doing right or wrong. I think it's only left to try curl with other sites... – GingerN Oct 19 '15 at 18:27
  • Solution as often in those strange cases was simple: CURLOPT_SSL_VERIFYPEER needed. – GingerN Oct 20 '15 at 12:29

0 Answers0