I am trying to download a site for offline viewing and this is requiring me to do a number of DOM manipulations (trust me, wget just is not doing what I need to do...).
I am finding that webpages containing tags with unusual text content is throwing saveHTML off.
For some url, if I use curl to read the page and output as
echo $contents;
then all is well.
For instance, there is a section of the page containing the following source:
<div id="area2516" class="component interaction_component float-none clear-none ">
<div id="area2516">
<script type="text/javascript">
window.bm = window.bm || {};
bm.data = bm.data || [];
bm.data['area2516'] = {};
</script>
<link rel="stylesheet" type="text/css" href="/somecss.css">
<script type="text/javascript" src="somejs.js">
</script>
<script class="main-template" type="text/x-handlebars-template">
<div class="content_area">
<div class="bg_image cf"></div>
{{#each rollovers}}
<div class="rollover_content" style="left: {{x}}; top: {{y}}; display: none;" data-rollover-id="{{id}}">
{{{this.content}}}
</div>
{{/each}}
</div>
<div class="rollover_links">
<ul>
{{#each rollovers}}
<li>
<a class="rollover_link" href="#" data-rollover-id="{{id}}">
{{{link}}}
</a>
</li>
{{/each}}
</ul>
</div>
</script>
<script type="text/javascript">
bm.data['area2516'].assets = {};
bm.data['area2516'].initial_json = '';
</script>
as seen from the above echo following the curl response.
Now, if I do this
$doc = new DOMDocument();
@$doc->loadHTML($contents);
$xpath = new DOMXpath($doc);
echo $doc->saveHTML();
the HTML gets messed up, such that above now becomes this:
<div id="area2516" class="component interaction_component float-none clear-none ">
<div id="area2516">
<script type="text/javascript">
window.bm = window.bm || {};
bm.data = bm.data || [];
bm.data['area2516'] = {};
</script>
<link rel="stylesheet" type="text/css" href="/somecss.css"> .
<script type="text/javascript" src="/somejs.js"></script>
<script class="main-template" type="text/x-handlebars-template">
<div class="content_area">
<div class="bg_image cf">
</script>
</div>
{{#each rollovers}}
<div class="rollover_content" style="left: {{x}}; top: {{y}}; display: none;" data-rollover-id="{{id}}">
{{{this.content}}}
</div>
{{/each}}
</div>
<div class="rollover_links">
<ul>
{{#each rollovers}}
<li>
<a class="rollover_link" href="#" data-rollover-id="{{id}}">
{{{link}}}
</a>
</li>
{{/each}}
</ul></div>
<script type="text/javascript">
bm.data['area2516'].assets = {};
bm.data['area2516'].initial_json = '';
</script>
Sorry about the formatting, this new editor is pretty annoying. The point is, you can see some pretty major differences, and I am not sure how saveHTML is causing this modification to the source. I suspect it had something to do with encoding and the peculiarity of these double and triple braces used by the templating system, but despite attempts to use various encoding parameters, I am getting the same result. Then I thought maybe has something to do with special chars, escaping, but I am just not sure what function(s) are needed to stop saveHTML from messing up the output.
Ideas?
Thanks