0

Hi this may be a silly question, but I can't find the answer anywhere. I'm writing a chrome extension, all I need is to read in the html of the current page so I can extract some data from it.

here's what I have so far:

<script>
    window.addEventListener("load", windowLoaded, false);
    function windowLoaded() {
        alert(document.innerHTML)
      });
    }
</script>

Can anybody tell me what I'm doing wrong? thanks,

Richard Mosse
  • 560
  • 7
  • 18

4 Answers4

2
function windowLoaded() {
    alert('<html>' + document.documentElement.innerHTML + '</html>');
}
addEventListener("load", windowLoaded, false);

Notice how windowLoaded is created before it is used, not after, which won't work.

Also notice how I am getting the innerHTML of document.documentElement, which is the html tag, then adding the html source tags around it.

Delan Azabani
  • 79,602
  • 28
  • 170
  • 210
  • You're not completely right about the order. http://jsfiddle.net/pimvdb/vuuFS/. By the way this will discard `` attributes. – pimvdb Aug 13 '11 at 13:56
  • That's weird; I don't understand how that is possible. Also, with respect to the `html` tag's attributes, it's possible to get the attributes, but only do so if they're needed; it's laborious and slow. – Delan Azabani Aug 13 '11 at 13:57
  • Perhaps you might want to have a look at http://stackoverflow.com/questions/336859/javascript-var-functionname-function-vs-function-functionname – pimvdb Aug 13 '11 at 13:58
  • 1
    That's completely new to me; I learn something new every day. Thanks! – Delan Azabani Aug 13 '11 at 13:59
  • 2
    Hi this is almost exactly what I need except it loads the wrong page, it loads the html of the extension file, I'm trying to load the html of the current tab, is this possible? – Richard Mosse Aug 13 '11 at 16:00
  • sorry nevermind forgot it needed to be in a content script thanks – Richard Mosse Aug 13 '11 at 16:49
2

I'm writing a chrome extension, all I need is to read in the html of the current page so I can extract some data from it.

I think an important answer here is not the correct code to use to alert the innerHTML but how to get the data you need from what's already been rendered.

As pimvdb pointed out, your code isn't working because of a typo and needing document.documentElement.innerHTML, something you can diagnose in the Chrome console (Ctrl+Shift+I). But that's secondary to why you'd want the inner HTML in the first place. Whether you're looking for a certain node, specific text, how many <div> elements exist, the value of an ID, etc., I'd heavily recommend the use of a library like jQuery (vanilla JS works, but it can be verbose and unwieldy). Instead of reading in all the HTML and parsing it with string functions or regex, you probably want to take advantage of all the DOM parsing functionality already available to you.

In other words, something like this:

$("#some_id").val();                      // jQuery
document.getElementById("some_id").value; // vanilla JS

is probably way safer, easier and more readable than something eminently breakable like this (probably a bit off here, but just to make a point):

innerHTML.match(/<[^>]+id="some_id"[^>]+value="(.*?)"[^>]*?>/i)[1];
Community
  • 1
  • 1
brymck
  • 7,555
  • 28
  • 31
1
window.addEventListener("load", windowLoaded, false);

function windowLoaded() {
    alert(document.documentElement.innerHTML);
}

You had a } with no purpose, and the }); should just be }. These are syntax errors.

Also, it's document.documentElement.innerHTML, since it's not a property of document.

pimvdb
  • 151,816
  • 78
  • 307
  • 352
  • This gets the HTML inside `body` only. Also, this still won't work anyway; the function definition of `windowLoaded` should be written before it is used. – Delan Azabani Aug 13 '11 at 13:53
  • @Delan Azabani: You're correct, thanks. I have to disagree though with your second point because `function ...() {}` definitions are read before the parsing of statements. – pimvdb Aug 13 '11 at 13:54
  • @Delan This is only true if a function is declared in the form `var f = function() { ... }` rather than `function f() { ... }`. The latter is [hoisted](http://elegantcode.com/2011/03/24/basic-javascript-part-12-function-hoisting/) to the top such that it can be used by lines _before_ it appears in the code proper. It's not the best practice, but it does work. – brymck Aug 13 '11 at 13:58
  • More surprising to me is that `var x = y;` is partially 'hoisted' as well such that it makes outer-scoped variables of the same name unavailable even to statements before the variable declaration. – Delan Azabani Aug 13 '11 at 14:02
  • @Delan Azabani: There are statements and expressions. Statements are parsed before expressions, so e.g. http://jsfiddle.net/pimvdb/jDSRw/1/ will not work as you might think. – pimvdb Aug 13 '11 at 14:07
  • I meant that the name is hoisted, like in your example's equivalent code, not the name and value. It has the effect of 'blocking' a variable of the same name from an outer scope. – Delan Azabani Aug 13 '11 at 14:10
  • @pimvdb I believe @Delan is referring to things like `var a=0;function b(){alert(a);var a=1;}b();` not working (it alerts `undefined`), while it works just fine if you delete the `var a=1;` part. The fact that `a` _exists_ in `b` is hoisted, but it remains `undefined` until assigned a value. This is part of why it's always a good idea to declare variables at the top of their scope, of course; it will save headaches later. – brymck Aug 13 '11 at 14:26
1

Use document.documentElement.outerHTML. (Note that this is not supported in Firefox; irrelevant in your case.) However, this is still not perfect as it doesn't return nodes outside the root element (!doctype and possibly some comments or processing instructions). The document.innerHTML property is, AFAIK, specified in HTML5 specification, but currently not supported in any browser.

Just FYI, navigating to view-source:www.example.com also displays the entire markup (Chrome & Firefox). But I don't know whether you can work with it somehow.

duri
  • 14,991
  • 3
  • 44
  • 49