I'm trying to collect some info from a web service, but I'm having issues with the CDATA Section of a page, because everything goes right when I use something like this:
$url = 'http://www.example.com';
$content = file_get_contents($url);
$doc = new DOMDocument();
$doc->loadHTML($content);
foreach($doc->getElementsByTagName('h3') as $subtitle) {
echo $subtitle->textContent; //The output is the Subtitle/s.
}
But when the page contains CDATA sections there is a problem with this error on the line $doc->loadHTML($content).
Warning: DOMDocument::loadHTML(): Invalid char in CDATA
I've seen over here a solution that I tried to implement without any success.
function sanitize_html($content) {
if (!$content) return '';
$invalid_characters = '/[^\x9\xa\x20-\xD7FF\xE000-\xFFFD]/';
return preg_replace($invalid_characters,'', $content);
}
$url = 'http://www.example.com';
$content = file_get_contents($url);
$cleanContent = sanitize_html($content);
$doc = new DOMDocument();
$doc->loadHTML($cleanContent); //Warning: DOMDocument::loadHTML(): htmlParseEntityRef: no name in Entity
But I got this other error:
Warning: DOMDocument::loadHTML(): htmlParseEntityRef: no name in Entity
What could be a good way to deal with the CDATA sections of a page? Greetings.