0

I want to get the HTML content in this page using file_get_contents as string :

https://www.emitennews.com/search/

Then I want to unminify the html code.

So far what I done to unminify it :

$html = file_get_contents("https://www.emitennews.com/search/");                                        
$dom = new \DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->loadHTML($html,LIBXML_HTML_NOIMPLIED);
$dom->formatOutput = true;
print $dom->saveXML($dom->documentElement);

But in the code above I got is error :

DOMDocument::loadHTML(): Tag header invalid in Entity, line: 1

What is the proper way to do it ?

Dennis Liu
  • 2,268
  • 3
  • 21
  • 43

2 Answers2

0

You must add the xml tag at the first line:

$dom = new DOMDocument();
$dom->loadHTML('<?xml encoding="UTF-8">' . $html);
Mark Smith
  • 138
  • 1
  • 8
  • I add the xml tag. But not working. When I view the source page, the html still minify. – Dennis Liu Sep 04 '22 at 15:53
  • You can prepare a simple $html value likes "abc" for testing first. I guess the html you got from the website has invalid XML structure. – Mark Smith Sep 04 '22 at 15:59
  • Thank You for response. The problem is in the HTML5 that website use. I need to put "libxml_use_internal_errors(true);" before load the html. – Dennis Liu Sep 05 '22 at 04:25
0

This is the correct code :

$html = file_get_contents("https://www.emitennews.com/search/");                                        
$dom = new \DOMDocument();
libxml_use_internal_errors(true);
$dom->preserveWhiteSpace = false;
$dom->loadHTML('<?xml encoding="UTF-8">' . $html,LIBXML_HTML_NOIMPLIED);
$dom->formatOutput = true;
print $dom->saveXML($dom->documentElement);

The problem is the site using HTML5. So we need to put :

libxml_use_internal_errors(true);

Dennis Liu
  • 2,268
  • 3
  • 21
  • 43