0

I am following a tutorial that shows how to write a program that parses a web page and finds all the links. However, this program works only on pages that use http. Whenever I try to run it against a site that has a certificate (https) it throws the following error:

Fatal error: Uncaught ValueError: DOMDocument::loadHTML(): Argument #1 ($source) must not be empty in C:\xampp\htdocs\froogal\classes\DomDocumentParser.php:14 Stack trace: #0 C:\xampp\htdocs\froogal\classes\DomDocumentParser.php(14): DOMDocument->loadHTML('') #1 C:\xampp\htdocs\froogal\crawl.php(6): DomDocumentParser->__construct('http://www.appl...') #2 C:\xampp\htdocs\froogal\crawl.php(18): followLinks('http://www.appl...') #3 {main} thrown in C:\xampp\htdocs\froogal\classes\DomDocumentParser.php on line 14

The code for the DomDocumentParser.php file is:

<?php
class DomDocumentParser {

    private $doc;

    public function __construct($url) {

        $options = array(
            'http'=>array('method'=>"GET", 'header'=>"User-Agent: doodleBot/0.1\n")
            );
        $context = stream_context_create($options);

        $this->doc = new DomDocument();
        @$this->doc->loadHTML(file_get_contents($url, false, $context));
    }

    public function getlinks() {
        return $this->doc->getElementsByTagName("a");
    }

}
?>

And the code for crawl.php is:

<?php
include("classes/DomDocumentParser.php");

function followLinks($url) {

    $parser = new DomDocumentParser($url);

    $linkList = $parser->getLinks();

    foreach($linkList as $link) {
        $href = $link->getAttribute("href");
        echo $href . "<br>";
    }

}

$startUrl = "http://www.apple.com";
followLinks($startUrl);
?>
  • try by changing http to https – Shibon Jul 29 '21 at 05:59
  • 3
    What does the `file_get_contents()` return? Have you made sure you can [see any and all error levels](https://stackoverflow.com/questions/845021/how-can-i-get-useful-error-messages-in-php)? – Phil Jul 29 '21 at 06:05

2 Answers2

0

I got the same error too. Then I found out that the file_get_contents() function creates a UTF-8 problem when retrieving data. You can get around this problem with a little trick solution. When importing data, it sets the file to UTF-8 and it works fine as if the data were in UTF-8. It worked for me, you can try it too. All you need to do is change this line:

@$this->document->loadHTML('<?xml encoding="UTF-8">'.file_get_contents($url,false,$context));
Bayramow
  • 1
  • 2
0

The actual issue here is passing an empty content to the loadHTML() method. To fix this error, you need to first check if the content is empty or not. If it's empty, you should focus on the previous lines. If the content is not empty, then you can look into the other issues mentioned by others.

Here's the translated code:

if (!empty($content)) {
    $dom = new DOMDocument();
    $dom->loadHTML($content);
    // Continue with your operations here
} else {
    echo "Content is empty.";
}
wplover
  • 85
  • 1
  • 9