0

Lets assume my $html looks like this:

<!DOCTYPE html>
<html>
<head>
    <script type="text/javascript">document.createElement("video");document.createElement("audio");document.createElement("track");</script>
    <script type="text/javascript" src="/gui/default/tinymcecontent.js"></script>
    <script type="text/javascript" src="/includes/js/video-js/video.min.js"></script>
    <link rel="stylesheet" href="/includes/js/video-js/video-js.css" />
    <script type="text/javascript">document.createElement("video");document.createElement("audio");document.createElement("track");</script>
    <script type"text/javascript" src="/includes/js/video-js/video.js"></script/>
    <link rel="stylesheet" href="/includes/js/video-js/video-js.css" />
</head>
<body style="font-family: arial;font-size: 12px;">
    <p> </p>
    <table width="100%">        
    </table>
</body>
</html>

When I try to parse only elements, that are inside body tag with commands:

$dom = new DOMDocument();

libxml_use_internal_errors(true);
$dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));
libxml_use_internal_errors(false);

$full_dom = $dom->getElementsByTagName('body')->item(0);

The result of

$dom->saveHTML($full_dom)

is

<body>\n<p>\/&gt;<link rel=\"stylesheet\" href=\"\/includes\/js\/video-js\/video-js.css\"><\/p>\n<p>\u00a0<\/p>\n<table width=\"100%\"><\/table>\n<\/body>

Element

<p>\/&gt;<link rel=\"stylesheet\" href=\"\/includes\/js\/video-js\/video-js.css\"><\/p>

comes from where? Everything else is good, just this element gets transfered from head tag into elements of body tag..

SubjectX
  • 836
  • 2
  • 9
  • 33

1 Answers1

1

It comes from the line :

<script type"text/javascript" src="/includes/js/video-js/video.js"></script/>

It is badly formed and should be :

<script type="text/javascript" src="/includes/js/video-js/video.js"></script>

You have to check errors after $dom->loadHTML() to see what's happend :

foreach (libxml_get_errors() as $error) {
    print_r($error);
}
Syscall
  • 19,327
  • 10
  • 37
  • 52
  • Gosh, I see.. What are my option for ignoring such errors that are in the section of html that I cannot control nor do I want to have anything with it? I try only to work on body tag, ignoring the rest.. – SubjectX Jan 29 '18 at 07:40
  • @SubjectX I think you can't. But maybe you can try to replace your string before the `` element before to parse it. Good luck. – Syscall Jan 29 '18 at 08:13
  • @SubjectX - Note that a conforming HTML5 parser will handle your malformed HTML much better - i.e. the same way that browsers do. There are several suggestions for such php libraries in the answers at https://stackoverflow.com/questions/10712503/how-to-make-html5-work-with-domdocument – Alohci Feb 01 '18 at 02:13
  • Thank you for suggestion. I do work on some legacy code, so implementing additional libraries is not that optimal for now.. – SubjectX Feb 01 '18 at 06:46