1

I am trying to download a site for offline viewing and this is requiring me to do a number of DOM manipulations (trust me, wget just is not doing what I need to do...).

I am finding that webpages containing tags with unusual text content is throwing saveHTML off.

For some url, if I use curl to read the page and output as

echo $contents;

then all is well.

For instance, there is a section of the page containing the following source:

<div id="area2516" class="component interaction_component float-none clear-none ">
    <div id="area2516">
        <script type="text/javascript">
            window.bm = window.bm || {};
            bm.data = bm.data || [];
            bm.data['area2516'] = {};
        </script>

        <link rel="stylesheet" type="text/css" href="/somecss.css">
        <script type="text/javascript" src="somejs.js">
        </script>

    <script class="main-template" type="text/x-handlebars-template">
            <div class="content_area">
                <div class="bg_image cf"></div>
                    {{#each rollovers}}
                <div class="rollover_content" style="left: {{x}}; top: {{y}}; display: none;" data-rollover-id="{{id}}">
                    {{{this.content}}}
                </div>
                {{/each}}
                </div>
                <div class="rollover_links">
                    <ul>
                        {{#each rollovers}}
                        <li>
                            <a class="rollover_link" href="#" data-rollover-id="{{id}}">
                                {{{link}}}
                            </a>
                        </li>
                        {{/each}}
                    </ul>
                </div>
        </script>


        <script type="text/javascript">
            bm.data['area2516'].assets = {};
            bm.data['area2516'].initial_json = '';
        </script>

as seen from the above echo following the curl response.

Now, if I do this

$doc = new DOMDocument();
@$doc->loadHTML($contents);
$xpath = new DOMXpath($doc);
echo $doc->saveHTML();

the HTML gets messed up, such that above now becomes this:

<div id="area2516" class="component interaction_component float-none clear-none ">
<div id="area2516">
    <script type="text/javascript">
        window.bm = window.bm || {};
        bm.data = bm.data || [];
        bm.data['area2516'] = {};
    </script>
    <link rel="stylesheet" type="text/css" href="/somecss.css"> . 
    <script type="text/javascript" src="/somejs.js"></script>
    <script class="main-template" type="text/x-handlebars-template">
        <div class="content_area">
            <div class="bg_image cf">
    </script>
            </div>
            {{#each rollovers}}
            <div class="rollover_content" style="left: {{x}}; top: {{y}}; display: none;" data-rollover-id="{{id}}">
              {{{this.content}}}
            </div>
          {{/each}}
        </div>
        <div class="rollover_links">
          <ul>
            {{#each rollovers}}
              <li>
                <a class="rollover_link" href="#" data-rollover-id="{{id}}">
                  {{{link}}}
                </a>
              </li>
            {{/each}}
          </ul></div>
<script type="text/javascript">
        bm.data['area2516'].assets = {};
        bm.data['area2516'].initial_json = '';
      </script>

Sorry about the formatting, this new editor is pretty annoying. The point is, you can see some pretty major differences, and I am not sure how saveHTML is causing this modification to the source. I suspect it had something to do with encoding and the peculiarity of these double and triple braces used by the templating system, but despite attempts to use various encoding parameters, I am getting the same result. Then I thought maybe has something to do with special chars, escaping, but I am just not sure what function(s) are needed to stop saveHTML from messing up the output.

Ideas?

Thanks

miken32
  • 42,008
  • 16
  • 111
  • 154
Brian
  • 561
  • 5
  • 16
  • Looks lite a SPA JS site, possibly running Vue - I think I've seen them using a Mustache-syntax on templates.. Fortunately that template engine has been ported to basically everything, you'll find it here: https://mustache.github.io/ – Christoffer Bubach Dec 22 '19 at 20:00

2 Answers2

0

Per the HTML 4 specification you can't put arbitrary text into a <script> element. (Although this is possible in HTML 5, the libxml parser included with PHP is not that new.)

If you properly escape the contents of the element, your code should work as expected.

$content = <<< HTML
<div id="area2516" class="component interaction_component float-none clear-none ">
    <div id="area2516">
        <script type="text/javascript">
            window.bm = window.bm || {};
            bm.data = bm.data || [];
            bm.data['area2516'] = {};
        </script>

        <link rel="stylesheet" type="text/css" href="/somecss.css">
        <script type="text/javascript" src="somejs.js">
        </script>

    <script class="main-template" type="text/x-handlebars-template">
            &lt;div class="content_area"&gt;
                &lt;div class="bg_image cf"&gt;&lt;/div&gt;
                    {{#each rollovers}}
                &lt;div class="rollover_content" style="left: {{x}}; top: {{y}}; display: none;" data-rollover-id="{{id}}"&gt;
                    {{{this.content}}}
                &lt;/div&gt;
                {{/each}}
                &lt;/div&gt;
                &lt;div class="rollover_links"&gt;
                    &lt;ul&gt;
                        {{#each rollovers}}
                        &lt;li&gt;
                            &lt;a class="rollover_link" href="#" data-rollover-id="{{id}}"&gt;
                                {{{link}}}
                            &lt;/a&gt;
                        &lt;/li&gt;
                        {{/each}}
                    &lt;/ul&gt;
                &lt;/div&gt;
        </script>


        <script type="text/javascript">
            bm.data['area2516'].assets = {};
            bm.data['area2516'].initial_json = '';
        </script>
HTML;
$doc = new DOMDocument();
$doc->loadHTML($content, LIBXML_HTML_NODEFDTD|LIBXML_HTML_NOIMPLIED);
echo $doc->saveHTML();

The output is as expected:

<div id="area2516" class="component interaction_component float-none clear-none ">
    <div id="area2516">
        <script type="text/javascript">
            window.bm = window.bm || {};
            bm.data = bm.data || [];
            bm.data['area2516'] = {};
        </script>

        <link rel="stylesheet" type="text/css" href="/somecss.css">
        <script type="text/javascript" src="somejs.js">
        </script>

    <script class="main-template" type="text/x-handlebars-template">
            &lt;div class="content_area"&gt;
                &lt;div class="bg_image cf"&gt;&lt;/div&gt;
                    {{#each rollovers}}
                &lt;div class="rollover_content" style="left: {{x}}; top: {{y}}; display: none;" data-rollover-id="{{id}}"&gt;
                    {{{this.content}}}
                &lt;/div&gt;
                {{/each}}
                &lt;/div&gt;
                &lt;div class="rollover_links"&gt;
                    &lt;ul&gt;
                        {{#each rollovers}}
                        &lt;li&gt;
                            &lt;a class="rollover_link" href="#" data-rollover-id="{{id}}"&gt;
                                {{{link}}}
                            &lt;/a&gt;
                        &lt;/li&gt;
                        {{/each}}
                    &lt;/ul&gt;
                &lt;/div&gt;
        </script>


        <script type="text/javascript">
            bm.data['area2516'].assets = {};
            bm.data['area2516'].initial_json = '';
        </script></div></div>

Note your HTML is invalid in other ways; repeated id attributes and missing closing elements.

miken32
  • 42,008
  • 16
  • 111
  • 154
  • Not sure how to escape the chars in this way only for script inner content, but I came across this: https://stackoverflow.com/questions/4029341/dom-parser-that-allows-html5-style-in-script-tag and it seems to be working...need to test to confirm once I have the rest of the page ready to check. thanks – Brian Dec 21 '18 at 14:46
0

the input does not even look alike HTML, but alike a Twig (or similar) template ...

which would need to be pushed through a template engine first, in order to get HTML output;

unless passing (array) $rollovers ...this will not yield the desired results, for certain.

if these aren't your own template files, you might be downloading the wrong URL ...

and someone on the other side has forgotten to prevent access to the templates.

Martin Zeitler
  • 1
  • 19
  • 155
  • 216
  • Maybe it has something to do with loadHTML vs loadXML? The template system is javascript Handlebars from what I can tell. I think the site will load locally for me if I can just get the DOMDocument tools to leave everything untouched except for the src and href attributes that I am modifying. thanks, Brian – Brian Dec 21 '18 at 12:56