PHP crawler not working for wikipedia

Question

Below is my php code to output the text under the id=Summary. Well this script is working fine for other webistes but not for wikipedia.I have also pasted the error I am getting below. Is wikipedia restricting the parser script? if so, is there any solution to parse and get the content from wiki? Thanks in advance.

<?php


function getElementByIdAsString($url, $id, $pretty = true) {
    $doc = new DOMDocument();

    $ch = curl_init($url);
    curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36');
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

    $result = curl_exec($ch);


//    var_dump($doc->loadHTMLFile($url)); die;
error_reporting(E_ERROR | E_PARSE);
    if(!$result) {
        throw new Exception("Failed to load $url");
    }
    $doc->loadHTML($result);
    // Obtain the element
    $element = $doc->getElementById($id);

    if(!$element) {
        throw new Exception("An element with id $id was not found");
    }

    if($pretty) {
        $doc->formatOutput = true;
    }

    // Return the string representation of the element
    return $doc->saveXML($element);
}

//Here I am dispalying the output in bold text
echo getElementByIdAsString('https://en.wikipedia.org/wiki/A_Brief_History_of_Time', 'Summary');
?>

Error:

Fatal error: Uncaught exception 'Exception' with message 'Failed to load http://en.wikipedia.org/wiki/A_Brief_History_of_Time' in C:\xampp\htdocs\example2.php:25 Stack trace: #0 C:\xampp\htdocs\example2.php(49): getElementByIdAsString() #1 {main} thrown in C:\xampp\htdocs\example2.php on line 25

http://php.net/manual/en/function.curl-error.php this function would return the errer from CURL — Mathieu de Lorimier, Mar 24 '16 at 17:32
SSL certificate problem, verify that the CA cert is OK. Details: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed — Iqbal Honnur, Mar 24 '16 at 17:38
Possible duplicate of [HTTPS and SSL3\_GET\_SERVER\_CERTIFICATE:certificate verify failed, CA is OK](http://stackoverflow.com/questions/6400300/https-and-ssl3-get-server-certificatecertificate-verify-failed-ca-is-ok) — Mathieu de Lorimier, Mar 24 '16 at 18:36
I have tried, but didn't work..can you correct my code to functional if possible..Thanks — Iqbal Honnur, Mar 24 '16 at 18:38

score 1 · Answer 1 · edited May 23 '17 at 12:31

1

it's looks like it's a duplicate of this: php crawler for wiki getting error

the reason is that the curl try to verify cert so just adding:

curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);

remove the problem, but i sugest to use all of this

curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);

edited May 23 '17 at 12:31

Community

1
1

answered Mar 25 '16 at 05:02

Paweł Liszka

330
1
2
17

PHP crawler not working for wikipedia

1 Answers1