1

I'm looking for a way to read the source code of a page after it finished loading and inspect the code to see if it contains a specific text.

I found this reference but this only returns the text visible in the page and not the whole HTML code.

For instance, if the html source code is:

<html>
<header>
<header>
<body>
<p> This is a paragraph</a>
<body>
</html>

I want the script to print exactly the same thing.

Your help is appreciated.

gen_Eric
  • 223,194
  • 41
  • 299
  • 337
Leo S.
  • 29
  • 3
  • 2
    Can you elaborate what you mean by "print"? – Patrick Roberts Jan 22 '16 at 20:49
  • Right click, View Page Source? :) – nem035 Jan 22 '16 at 20:49
  • What are you looking for in the source code exactly? Why do you want to "inspect the code" versus using jQuery to traverse the DOM? – gen_Eric Jan 22 '16 at 20:49
  • 1
    You could take the innerHTML property of the tag, like it is proposed in your link. – ssc-hrep3 Jan 22 '16 at 20:50
  • 2
    You can get the page markup using document.documentElement.innerHTML (Source:http://stackoverflow.com/questions/817218/how-to-get-the-entire-document-html-as-a-string) – Shashank Karam Jan 22 '16 at 20:50
  • @ShashankReddyKaram good link but based on OP's reference, it seems like he wants the markup from an XMLHttpRequest rather than from the current document. – Patrick Roberts Jan 22 '16 at 20:52
  • Possible duplicate of [How to print HTML content on click of a button, but not the page?](http://stackoverflow.com/questions/16894683/how-to-print-html-content-on-click-of-a-button-but-not-the-page) – Asons Jan 22 '16 at 20:54
  • Sorry for the confusion about the word "print". What I want to achieve is the same result that the "right click > inspect element" would give. What I'm trying to do is: 1) Open URL 2) Wait for the page to load 3) Check if page contains an iframe 4) Display a message if the iframe is found – Leo S. Jan 22 '16 at 21:07
  • @LeoS.: Why not just do something like `document.getElementsByTagName('iframe')` (or `$('iframe')`)? To do this after the page loads, you can use `window.addEventListener('load', function() {})` (or `$(function(){})`). – gen_Eric Jan 22 '16 at 21:09
  • Possible duplicate of [Best Way to View Generated Source of Webpage?](https://stackoverflow.com/questions/1750865/best-way-to-view-generated-source-of-webpage) – Asons Mar 11 '18 at 10:12

4 Answers4

0

Do like this, call this function on load

Fiddle Demo

function printBody() {
  // store oiginal content
  var originalContents = document.body.innerHTML;

  // get the outer html of the document element
  document.body.innerText = document.documentElement.outerHTML;

  // call window.print if you want it on paper
  window.print();

  // or put it into an iframe
  // var ifr = document.createElement('iframe');
  // ifr.src = 'data:text/plain;charset=utf-8,' + encodeURI(document.documentElement.outerHTML);
  // document.body.appendChild(iframe);

  // a small delay is needed so window.print does not get the original
  setTimeout(function(){
    document.body.innerHTML = originalContents;
  }, 2000);
}

Src: Print <div id=printarea></div> only?

Community
  • 1
  • 1
Asons
  • 84,923
  • 12
  • 110
  • 165
0

Assuming that by 'print' you don't actually mean to transfer it to a paper copy, you can add some script like:

window.addEventListener('load', function() {
    var content = document.documentElement.innerHTML,
        pre = document.createElement('pre'),
        body = document.body;

    pre.innerText = content;

    body.insertBefore(pre, body.firstChild);
});

What this does, step by step is:

  • window.addEventListener('load', function() > Wait for the page to be fully loaded and then execute the function
  • content = document.documentElement.innerHTML > store the actual page source in the content variable (document.documentElement refers to the 'root'-node, usually <html> in html documents
  • pre = document.createElement('pre') > create a new <pre>-element
  • body = document.body > create a reference to the <body> element
  • pre.innerText = content > assign the HTML-structure we've stored earlier as text to the <pre>-element
  • body.insertBefore(pre, body.firstChild) > put the <pre>-element (now with contents) before any other element in the body (usually on top of the page).

This leaves you with the entire source (as it was before creating the <pre>-element containing the source) on top of you page.


Edit: Added <iframe> workflow It was not clear to me you actually wanted to target an <iframe>, so here's how to do that (using a naive approach, more on that further on):

window.addEventListener('load', function() {
    var iframeList = document.getElementsByTagName('iframe'),
        body = document.body,
        content, pre, i;

    for (i = 0; i < iframeList.length; ++i) {
        content = iframeList[i].documentElement.innerHTML;
        pre = document.createElement('pre');

        pre.innerText = content;
        body.insertBefore(pre, body.firstChild);
    }
});

why is this approach naive?

There is a thing called Same-Origin-Policy in javascript, which prevents you from accessing <iframe>-content which if the contents do not originate from the same domain as the page containing the <iframe>.

There are several ways to take this into consideration, you could wrap the inside of the for-loop in try/catch-blocks, though I prefer to use a more subtle approach by not even considering <iframes> which do not match the Same-Origin-Policy.

In order to do this, you can swap the getElementsByTagName method with the querySelectorAll method (please note the compatibility table at the bottom of that page, see if it matches your requirements). The querySelectorAll accepts a valid CSS selector and will return a NodeList containing all matching elements.

A simple selector to use would be 'iframe[src]:not([src^="//"]):not(src^="http")' which selects all iframe with a src attribute which does not start with either // or http

Disclaimer: I never use a <base>-tag (which changes all relative paths within the HTML) or refer to the current website using a path containing the domain, so the example CSS-selector does not consider these aberrations.

Can you use :not()

IE9 or better

Can you use document.querySelector(All)

IE8 or better (in order to use with :not(), IE9 or better)

hover/click the boxes above to show the spoiler

Rogier Spieker
  • 4,087
  • 2
  • 22
  • 25
  • Hi Rogier, your script works well. I made a small mod to search for the iframe as suggested by @Rocket Hazmat However, how do I search for another element inside that iframe now? Currently, my code looks like this: ` window.addEventListener('load', function() { var content = document.getElementsByTagName('iframe').contentDocument.documentElement.innerHTML, pre = document.createElement('pre'), body = document.body; pre.innerText = content; body.insertBefore(pre, body.firstChild); }); ` – Leo S. Jan 22 '16 at 21:52
  • You're on the right track, there is just a small mistake: `document.getElementsByTagName('iframe')` will return a `NodeList` containing zero or more elements, it does not return a single element. – Rogier Spieker Jan 23 '16 at 11:18
0

I think you are over-complicating this problem. You don't need to "print" the page's HTML or "inspect the code".

In a comment, you said:

Check if page contains an iframe [and] Display a message if the iframe is found

You can just use DOM traversal functions to examine the DOM.

Try something like this:

window.addEventListener('load', function() {
    if(document.getElementsByTagName('iframe').length){
        console.log('Found an iframe');
    }
});

Or with jQuery:

$(function() {
    if($('iframe').length){
        console.log('Found an iframe');
    }
});
gen_Eric
  • 223,194
  • 41
  • 299
  • 337
0

That's so simple, you can use this method to run a script after a page is fully loaded window.onload

function load(){
    console.log(document.getElementsByTagName('html')[0].innerHTML);
}
window.onload = load;

For further explanations, check this post

Community
  • 1
  • 1