8

I am trying to figure out how to retrieve the full (that means all data) HTML page source from an <iframe> whose src is from the same originating domain as the page that it is embedded on. I want the exact source code at any given time, which could be dynamic due to Javascript or php generating the <iframe> html output. This means AJAX calls like $.get() will not work for me as the page could have been modified via Javascript or generated uniquely based on the request time or mt_rand() in php. I have not been able to retrieve the exact <!DOCTYPE> declaration from my <iframe>.

I have been experimenting around and searching through Stack Overflow and have not found a solution that retrieves all of the page source including the <!DOCTYPE> declaration.

One of the answers in How do I get the entire page's HTML with jQuery? suggests that in order to retrieve the <!DOCTYPE> information, you need to construct this declaration manually, by retrieving the <iframe>'s document.doctype property and then adding all of the attributes to the <!DOCTYPE> declaration yourself. Is this really the only way to retrieve this information from the <iframe>'s HTML page source?

Here are some notable Stack Overflow posts that I have looked through and that this is not a duplicate of:

Here is some of my local test code that illustrates my best attempt so far, which only retrieves the data within and including the <iframe>'s <html> tag:

main.html

<html>
<head>
  <title>Testing with iframe</title>
  <script src="http://code.jquery.com/jquery-1.9.1.min.js"></script>
  <script type="text/javascript">
  function test() {
    var doc = document.getElementById('iframe-source').contentWindow.document;
    var html = $('html', doc).clone().wrap('<p>').parent().html();
    $('#output').val(html);
  }
  </script>
</head>
<body>

<textarea id="output"></textarea>
<iframe id="iframe-source" src="iframe.html" onload="javascript:test()"></iframe>

</body>
</html>


iframe.html

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html class="html-tag-class">
  <head class="head-tag-class">
    <title>iframe Testing</title>
  </head>
  <body class="body-tag-class">
    <h2>Testing header tag</h2>
    <p>This is <strong>very</strong> exciting</p>
  </body>
</html>


And here is a screenshot of these files run together in Google Chrome version 27.0.1453.110 m: iframe testing

Summary

As you can see, Google Chrome's Inspect element shows that within the <iframe> the <!DOCTYPE> declaration is present, so how can I retrieve this data with the page source? This question also applies to any other declarations or other tags that are not contained within the <html> tags.


Any help or advice on retrieving this full page source code via Javascript would be greatly appreciated.

Community
  • 1
  • 1
Aiias
  • 4,683
  • 1
  • 18
  • 34
  • 3
    "I want the exact source code at any given time" - seems like you have some misconceptions. "HTML source" is unchangeable - it is the HTML string served from the server (e.g. PHP). What is dynamic is the DOM (parsed HTML) which JS acts upon. `innerHTML`/`outerHTML` is nothing more than a serialization of the DOM. So, to summarize, you either send an Ajax request to the page and obtain the HTML source (the actual source before the JS executes) or to get the a serialization of the DOM use the answer which you linked. – Fabrício Matté Jun 09 '13 at 04:57
  • @FabrícioMatté - Thanks for your response. The serialization of the `DOM` may not match the page source exactly, but I suppose manually constructing the `doctype` would be required in that case. – Aiias Jun 09 '13 at 04:58
  • How likely is the source going to change between requests? If you want the exact doctype string, you could use ajax to get the source, extract the doctype string, and then proceed with using the DOM changes. Depending upon how the html is being served from the webserver and how it is requested, it might only end up with one request and then always use cache (probably not optimal in your situation, though), or a `200 OK` and a `304 Not Modified` (or something similar; I'm pretty sure I have the HTTP codes right at least). – JayC Jun 09 '13 at 05:05
  • @JayC - In my use case the page source code would be different for every request since the source code is being modified via the UI. – Aiias Jun 09 '13 at 05:09
  • ?? So your're modifying html text, posting the modified html to the webserver, and then having the webserver send that back to you in the iframe? I guess I can understand why you might want that workflow, but it's quite unnecessary except maybe as a sanity check. Look at http://htmledit.squarefree.com to see what I mean. – JayC Jun 09 '13 at 05:15
  • @JayC - There are no webserver requests at the moment. All of it is done/updated with Javascript. What I am doing is very similar to the resource in your last comment. – Aiias Jun 09 '13 at 05:17
  • Somehow I think we just miss-communicated somehow, but I'm not sure how to rectify. I was just suggesting a way to get the precise DOCTYPE used (like if it had extra spaces or something), ignore the following html, and then set up the iframe to the same url, and forevermore just use the iframe document's outerHTML to get the DOM's serialization. I'm doubtful "the precise DOCTYPE used" is even useful in your situation. – JayC Jun 09 '13 at 05:31
  • @JayC - Ensuring the ` ` declaration matches up is more of a consistency issue. Also, I am not certain if there are other types of declarations or other tags that would fall under the category of `not within the tags`. – Aiias Jun 09 '13 at 05:33
  • Good question. I can't remember if comments, CTAGS, etc. are serialized. – JayC Jun 09 '13 at 05:39

1 Answers1

2

Here is a way to build it from the doctype, seems to work for html 4 and 5, I didn't test for stuff like svg.

<html>
<head>
  <title>Testing with iframe</title>
  <script src="http://code.jquery.com/jquery-1.9.1.min.js"></script>
  <script type="text/javascript">
  function test() {
    var d = document.getElementById('iframe-source').contentWindow.document;
    var t = d.docType;
    $('#output').val(
        "<!DOCTYPE "+t.name+ 
          (t.publicId? (" PUBLIC "+JSON.stringify(t.publicId)+" ") : "")+
          (t.systemId? JSON.stringify(t.systemId) :"")+
          ">\n" + d.documentElement.outerHTML  );
  }
  </script>
</head>
<body>

<textarea id="output"></textarea>
<iframe id="iframe-source" src="iframe.html" onload="test()"></iframe>

</body>
</html>

this also uses HTML.outerHTML to make sure you get any attribs on the documentElement.

dandavis
  • 16,370
  • 5
  • 40
  • 36