How do I get the original innerHTML source without the Javascript generated contents?

Question

Is it possible to get in some way the original HTML source without the changes made by the processed Javascript? For example, if I do:

<div id="test">
    <script type="text/javascript">document.write("hello");</script>
</div>

If I do:

alert(document.getElementById('test').innerHTML);

it shows:

<script type="text/javascript">document.write("hello");</script>hello

In simple terms, I would like the alert to show only:

<script type="text/javascript">document.write("hello");</script>

without the final hello (the result of the processed script).

In which browser did you test this? In FF4b7 and Chrome 8 I get `hello` — Marcel Korpel, Dec 09 '10 at 11:23
@Marcel: I updated the question, I forgot a piece. Sorry for that. — Marco Demaio, Dec 09 '10 at 13:08
And I fear you don't know in advance what text is added, do you? — Marcel Korpel, Dec 09 '10 at 13:12
@Marcel: what do you mean? The text added in the example is `hello` coz it's created by the `document.write("hello")`. I'm looking for a general purpose solution not dependent on the code inside the DIV, something that returns always the original source code without the modifications made by the Javascript engine. — Marco Demaio, Dec 09 '10 at 13:24
Yeah, that's what I feared. But when elements are added to the DOM, there's no way to distinguish between original markup and dynamically added elements/nodes (unless you mark them as such), at least not as far as I know. — Marcel Korpel, Dec 09 '10 at 13:38
Why do you need to do this? I'm sure there's a workaround to whatever you're trying to do if you tell us what that is. — Sasha Chedygov, Dec 16 '10 at 02:42
@musicfreak: let's say you have a simple CMS, innerHTML for DIVs on your page can be changed using javascript by final user, and than when he saves the page the innerHTML contents of each DIV is sent to server to be stored on DB. When the innerHTML containes — Marco Demaio, Dec 18 '10 at 18:32
It's a bit of a hack but why not just download the current url using AJAX? You should get the original source with a couple of caveats (POST data would be ignored and anything random or time-dependent might be different) — Basic, Jul 10 '14 at 17:50

David Tang · Answer 1 · 2010-12-16T03:02:15.840

I don't think there's a simple solution to just "grab original source" as it'll have to be something that's supplied by the browser. But, if you are only interested in doing this for a section of the page, then I have a workaround for you.

You can wrap the section of interest inside a "frozen" script:

<script id="frozen" type="text/x-frozen-html">

The type attribute I just made up, but it will force the browser to ignore everything inside it. You then add another script tag (proper javascript this time) immediately after this one - the "thawing" script. This thawing script will get the frozen script by ID, grab the text inside it, and do a document.write to add the actual contents to the page. Whenever you need the original source, it's still captured as text inside the frozen script.

And there you have it. The downside is that I wouldn't use this for the whole page... (SEO, syntax highlighting, performance...) but it's quite acceptable if you have a special requirement on part of a page.

Edit: Here is some sample code. Also, as @FlashXSFX correctly pointed out, any script tags within the frozen script will need to be escaped. So in this simple example, I'll make up a <x-script> tag for this purpose.

<script id="frozen" type="text/x-frozen-html">
   <div id="test">
      <x-script type="text/javascript">document.write("hello");</x-script>
   </div>
</script>
<script type="text/javascript">
   // Grab contents of frozen script and replace `x-script` with `script`
   function getSource() {
      return document.getElementById("frozen")
         .innerHTML.replace(/x-script/gi, "script");
   }
   // Write it to the document so it actually executes
   document.write(getSource());
</script>

Now whenever you need the source:

alert(getSource());

See the demo: http://jsbin.com/uyica3/edit

Could you plz show a short piece of code. I don't understand. — Marco Demaio, Dec 10 '10 at 11:30
I thought that this might actually work, so I gave it a go. The main problem I saw was when you are trying to put script tags inside the frozen tag. (I used the original poster's snippets) You will need to do some escaping and some string replacing to get that to work. — FlashXSFX, Dec 10 '10 at 19:27

score 4 · Answer 2 · edited May 23 '17 at 12:13

4

A simple way is to fetch it form the server again. It will be in the cache most probably. Here is my solution using jQuery.get(). It takes the original uri of the page and loads the data with an ajax call:

$.get(document.location.href, function(data,status,jq) {console.log(data);})

This will print the original code without any javascript. It does not do any error handling!

If don't want to use jQuery to fetch the source, consult the answer to this question: How to make an ajax call without jquery?

edited May 23 '17 at 12:13

Community

1
1

answered Aug 08 '14 at 00:03

Michael_Scharf

33,154
22
74
95

Excellent idea! I had an issue where scraping a site without a web browser was impossible, but at the same time the site was destroying some data (which I needed) after loading up. With this approach, the slow and inefficient loading is done once, whereas the actual reading of the site html is very efficiently done from the same browser session, so it solves two problems at once. – Christian Apr 18 '20 at 01:37

score 2 · Answer 3 · answered Dec 16 '10 at 02:41

Could you send an Ajax request to the same page you're currently on and use the result as your original HTML? This is foolproof given the right conditions, since you are literally getting the original HTML document. However, this won't work if the page changes on every request (with dynamic content), or if, for whatever reason, you cannot make a request to that specific page.

Jules · Answer 4 · 2010-12-16T02:30:58.020

1

Brute force approach

var orig = document.getElementById("test").innerHTML;
alert(orig.replace(/<\/script>[.\n\r]*.*/i,"</script>"));

EDIT:

This could be better

var orig = document.getElementById("test").innerHTML + "<<>>";
alert(orig.replace( /<\/script>[^(<<>>)]+<<>>/i, "<\/script>"));

edited Dec 16 '10 at 02:30

answered Dec 10 '10 at 05:45

Jules

1,423
13
22

Beside the fact that the you forgot a slash `replace(/<\/script>[.\n\r]*.*/i,"<\/script>")` and that I don't understand why you placed a dot inside the `[.\n\r]`, it might anyway be a good attempt and a possible approach, so +1. Anyway it's still very specific, i.e. if a add a simple new line `document.write("hello\nchina");` your regex would replace only `hello`, and live `china` where it is. – Marco Demaio Dec 10 '10 at 11:23
@Marco, thanks for correcting the regex. As I said it is a brute force approach (not an elegant/generic one). – Jules Dec 14 '10 at 09:53

score 0 · Answer 5 · answered Dec 10 '10 at 21:04

If you override document.write to add some identifiers at the beginning and end of everything written to the document by the script, you will be able to remove those writes with a regular expression.

Here's what I came up with:

    <script type="text/javascript" language="javascript">
        var docWrite = document.write;
        document.write = myDocWrite;

        function myDocWrite(wrt) {
            docWrite.apply(document, ['<!--docwrite-->' + wrt + '<!--/docwrite-->']);
        }
    </script>

Added your example somewhere in the page after the initial script:

    <div id="test">
        <script type="text/javascript">     document.write("hello");</script>
    </div>

Then I used this to alert what was inside:

    var regEx = /<!--docwrite-->(.*?)<!--\/docwrite-->/gm;
    alert(document.getElementById('test').innerHTML.replace(regEx, ''));

Please be more specific. Original post was asking how to use document.write, and still get the original source. — FlashXSFX, Nov 30 '11 at 19:55

score 0 · Answer 6 · answered Dec 10 '10 at 21:39

If you want the pristine document, you'll need to fetch it again. There's no way around that. If it weren't for the document.write() (or similar code that would run during the load process) you could load the original document's innerHTML into memory on load/domready, before you modify it.

score 0 · Answer 7 · answered Dec 10 '10 at 22:00

I can't think of a solution that would work the way you're asking. The only code that Javascript has access to is via the DOM, which only contains the result after the page has been processed.

The closest I can think of to achieve what you want is to use Ajax to download a fresh copy of the raw HTML for your page into a Javascript string, at which point since it's a string you can do whatever you like with it, including displaying it in an alert box.

score 0 · Answer 8 · answered Jan 02 '18 at 02:21

A tricky way is using <style> tag for template. So that you do not need rename x-script any more.

console.log(document.getElementById('test').innerHTML);

<style id="test" type="text/html+template">
    <script type="text/javascript">document.write("hello");</script>
</style>

But I do not like this ugly solution.

score -1 · Answer 9 · answered Dec 09 '10 at 11:34

I think you want to traverse the DOM nodes:

var childNodes = document.getElementById('test').childNodes, i, output = [];

for (i = 0; i < childNodes.length; i++)
    if (childNodes[i].nodeName == "SCRIPT")
        output.push(childNodes[i].innerHTML);

return output.join('');

How do I get the original innerHTML source without the Javascript generated contents?

9 Answers9

Linked