0

I want to extract url data just like as facebook. For that I am using php DOMDocument.While retrieving the DOM content i.e while retrieving "title" the DOMDocument is returning 0 elements. Here is my code

    <?php
    header("Content-Type: text/xml");
    echo '<?xml version="1.0" encoding="UTF-8" ?>';    
    //$url = $_REQUEST["url"];
    $url = "http://business.tutsplus.com/articles/how-to-set-up-your-first-magento-store--fsw-43137";
    $ch = curl_init();
    curl_setopt($ch,CURLOPT_URL,$url);
    curl_setopt($ch,CURLOPT_HEADER,0);
    curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
    curl_setopt($ch,CURLOPT_FOLLOWLOCATION,1);
    curl_setopt($ch,CURLOPT_SSL_VERIFYPEER,false);
    $data = curl_exec($ch);
    curl_close($ch);
    $dom = new DOMDocument();
    @$dom->loadHTML($data);
    $title = $dom->getElementsByTagName("title");
    //$title = $dom->find("title");
    echo "<urlData>";
        echo "<title>";
            echo $title->length;
        echo "</title>";
    echo "</urlData>";
?>

Here $title->length is returning 0 elements. What is the problem?

Konrad Krakowiak
  • 12,285
  • 11
  • 58
  • 45
Siva Charan
  • 95
  • 3
  • 15
  • 3
    You're suppressing errors with @ in at least one place there. If you don't, do you get any warnings that could conceivably be relevant? – IMSoP Apr 28 '15 at 20:57
  • i removed the comment.it is giving the following output without raising any errors or warnings. – Siva Charan Apr 28 '15 at 21:01
  • This XML file does not appear to have any style information associated with it. The document tree is shown below. 0 – Siva Charan Apr 28 '15 at 21:01
  • 1
    I see the $data mention something about typing a captcha. Assume page is looking at things like agent. Consider spoofing the `agent` along with the request - I assume there is no API hence your scraping... – ficuscr Apr 28 '15 at 21:04
  • i didn't get you @ficuscr – Siva Charan Apr 28 '15 at 21:07
  • 1
    Look here: http://codepad.viper-7.com/CV7cIV Note the page returned is not the page shown when navigating to that URI on a web browser. It is checking to see you are human and not a robot, hence the captcha. Try dumping `$data` and looking at what you are trying to parse. – ficuscr Apr 28 '15 at 21:10
  • i tried with $url="google.com".and the output gave the following warning – Siva Charan Apr 28 '15 at 21:26
  • Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line: 38 in C:\wamp\www\charan\parseURL\extracturl.php on line 16 where line 16 is $dom->loadHTML($data); – Siva Charan Apr 28 '15 at 21:27
  • getting off topic...http://stackoverflow.com/questions/1685277/warning-domdocumentloadhtml-htmlparseentityref-expecting-in-entity describes your latest issue. Suggest closing this question. – ficuscr Apr 28 '15 at 21:54

0 Answers0