Chrome Extension - Get the true raw HTML document before JavaScript rendering

Question

I'm using a content script (loaded with run_at: document_start to try and grab the exact source of the page before any DOM modifications take place from JavaScript.

I want the pure HTML - exactly what you'd get from Right Click > View Source in the browser.

I've tried two methods which both nearly work but not quite.

Here's the actual raw source of the page, from Right Click > View Source

<!doctype html>
<html lang="en">
<head>
    <title>Raw HTML title</title>
</head>

<body>

<p>Something here.</p>

<script>
    document.title = 'Title injected by JS';
</script>

</body>

</html>

Things I've tried:

new XMLSerializer().serializeToString(document)

This produces the following:

<!DOCTYPE html><html xmlns="http://www.w3.org/1999/xhtml" lang="en"><head>
    <title>Raw HTML title</title>
</head>

<body>

<p>Something here.</p>

<script>
    document.title = 'Title injected by JS';
</script></body></html>

It's close, but for some reason the formatting isn't correct, 'doctype' is capitalised and the xmlns attribute added to the <html> tag.

document.documentElement.outerHTML

Produces the following:

<html lang="en"><head>
    <title>Raw HTML title</title>
</head>

<body>

<p>Something here.</p>

<script>
    document.title = 'Title injected by JS';
</script></body></html>



</body></html>

It's missing the doctype and the formatting is also not as per the original.

Likely a dupe of https://stackoverflow.com/questions/3024026/how-to-get-a-page-source-in-html — mplungjan, Jun 22 '18 at 13:24
Accessing via `document` is always going to give you the browsers interpretation of the HTML, post rendering. — Alex K., Jun 22 '18 at 13:24
Content scripts can't get an "exact source of the page before any DOM modifications take place from JavaScript" because when the browser loads a page it also *must* process inline scripts that modify the page whether you want it or not. You can get a server response, though, but you'll have to make a separate XHR/fetch request with the exact URL and cookies. — wOxxOm, Jun 22 '18 at 13:25
Possible duplicate of [How to get the static, original HTML source via JavaScript?](https://stackoverflow.com/questions/27157361/how-to-get-the-static-original-html-source-via-javascript) — wOxxOm, Jun 22 '18 at 13:44
@wOxxOm Both `new XMLSerializer().serializeToString(document)` and `document.documentElement.outerHTML` are doing exactly that though, note the title tag is the raw HTML title, before modification. The issue is the formatting is not the same, and some additional attributes have been added — Jon, Jun 22 '18 at 14:21
That may simply be a bug in the XMLSerializer API which isn't widely used, and is likely a legacy burden. It's not doing anything useful anyway: simply replace the script with `document.body.textContent = ''` and you'll see an empty body. The linked duplicate thread explains why that happens. In short: there's no such thing as unaltered page source, there's only DOM. — wOxxOm, Jun 22 '18 at 14:26

score 0 · Accepted Answer · answered Jun 26 '18 at 08:47

Doesn't seem you can get the 'pure' HTML as seen in view source. The closest you can get is what's given back by

new XMLSerializer().serializeToString(document)

If you trigger the above in a content script run at run_at: document_start (before anything exists in DOM at all) and monitor for DOM mutations you can grab the first mutation with something like this:

var observer = new MutationObserver(function(mutations) {
    mutations.forEach(function(mutation) {
        var rawHTML = new XMLSerializer().serializeToString(document);
        console.log(rawHTML);
    });
});

var config = { attributes: true, childList: true, characterData: true }
observer.observe(target, config);

XMLSerializer() has solid browser support: https://caniuse.com/#feat=xml-serializer

Chrome Extension - Get the true raw HTML document before JavaScript rendering

1 Answers1