0

I'm trying to parse HTML using CURL DOMDocument or Xpath, but the CURLOPT_RETURNTRANSFER always returns the url's HTML in string which makes it invalid HTML to be parsed

Returned output:

string(102736) "<!DOCTYPE html>


    <html itemscope itemtype="http://schema.org/QAPage" class="html__responsive">

    <head>

        <title>html - PHP outputting text WITHOUT echo/print? - Stack Overflow</title>
        <link rel="shortcut icon" href="https://cdn.sstatic.net/Sites/stackoverflow/img/favicon.ico?v=4f32ecc8f43d">
        <link rel="apple-touch-icon image_src" href="https://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-icon.png?v=c78bd457575a">
        <link rel="search" type="application/opensearchdescription+xml" title="Stack Overflow" href="/opensearch.xml">
        <meta name="viewport" content="width=device-width, height=device-height, initial-scale=1.0, minimum-scale=1.0">"

PHP snipe see the output

$cc = $http->get($url);
var_dump($cc);

CURL library used: https://github.com/seikan/HTTP/blob/master/class.HTTP.php

When I remove CURLOPT_RETURNTRANSFER I see the HTML without the string(102736), but it echo the url even if i didn't request (reference: curl_exec printing results when I don't want to)

Here is the PHP snipe I used to parse html:

  $cc = $http->get($url);
  $doc = new \DOMDocument();
  $doc->loadHTML($cc);

  // all links in document
  $links = [];
  $arr = $doc->getElementsByTagName("a"); // DOMNodeList Object
  foreach($arr as $item) { // DOMElement Object
    $href =  $item->getAttribute("href");
    $text = trim(preg_replace("/[\r\n]+/", " ", $item->nodeValue));
    $links[] = [
      'href' => $href,
      'text' => $text
    ];
  }

Any idea?

1 Answers1

0

Check the return value -

print_r($cc);

you will probably find that the output is an array (if the code ran successfully). From the library source, the return of get() is...

return [
    'header' => $headers,
    'body'   => substr($response, $size),
];

So you will need to change the load line to be...

$doc->loadHTML($cc['body']);

Update:

as an example of the above and using this question as the page to work with...

$cc = $http->get("https://stackoverflow.com/questions/51319473/curlopt-returntransfer-returns-html-in-string/51319585?noredirect=1#comment89619183_51319585");
$doc = new \DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($cc['body']);

// all links in document
$links = [];
$arr = $doc->getElementsByTagName("a"); // DOMNodeList Object
foreach($arr as $item) { // DOMElement Object
    $href =  $item->getAttribute("href");
    $text = trim(preg_replace("/[\r\n]+/", " ", $item->nodeValue));
    $links[] = [
        'href' => $href,
        'text' => $text
    ];
}

print_r($links);

Outputs...

Array
(
    [0] => Array
        (
            [href] => #
            [text] => 
        )

    [1] => Array
        (
            [href] => https://stackoverflow.com
            [text] => Stack Overflow
        )

    [2] => Array
        (
            [href] => #
            [text] => 
        )

    [3] => Array
        (
            [href] => https://stackexchange.com/users/?tab=inbox
...
Nigel Ren
  • 56,122
  • 11
  • 43
  • 55
  • I followed your solution `$doc->loadHTML($cc['body']);` but it's still returning it in string var_dump or inquire it this way: `if (is_string($cc)) {echo "yes";}`, it all indicates that it's a string not plain HTML. –  Jul 13 '18 at 09:27
  • I've updated the code example with an test run using this page as the url with sample output. – Nigel Ren Jul 13 '18 at 14:12
  • Thank you so much, your answers helped, but can DOMDocument get elements based on css selector like Jquery? because I want to target href based on specific class. –  Jul 14 '18 at 08:45
  • You would have to use XPath, https://stackoverflow.com/questions/8680721/php-dom-xpath-search-for-class-name may help. – Nigel Ren Jul 14 '18 at 08:51