Firefox extensions & XUL: get page source code

Question

I am developing my first Firefox extension and for that I need to get the complete source code of the current page. How can I do that with XUL?

Lachlan Roche · Answer 1 · 2010-03-06T17:35:51.183

6

You will need a xul browser object to load the content into.

Load the "view-source:" version of your page into a the browser object, in the same way as the "View Page Source" menu does. See function viewSource() in chrome://global/content/viewSource.js. That function can load from cache, or not.

Once the content is loaded, the original source is given by:

var source = browser.contentDocument.getElementById('viewsource').textContent;

Serialize a DOM Document
This method will not get the original source, but may be useful to some readers.

You can serialize the document object to a string. See Serializing DOM trees to strings in the MDC. You may need to use the alternate method of instantiation in your extension.

That article talks about XML documents, but it also works on any HTML DOMDocument.

var serializer = new XMLSerializer();
var source = serializer.serializeToString(document);

This even works in a web page or the firebug console.

edited Mar 06 '10 at 17:35

answered Mar 06 '10 at 14:34

Lachlan Roche

25,678
5
79
77

This looks pretty complete, too. What happens if the XHTML is broken due to some error, though? – Franz Mar 06 '10 at 15:14
The DOM parser will already have dealt with broken HTML, so seriaizer will not see the broken source. – Lachlan Roche Mar 06 '10 at 15:35
That would probably be bad then? Does the `document` variable have the property `textContent`, too? – Franz Mar 06 '10 at 16:58
Your edit looks veeery interesting. If this works out, this should be it. – Franz Mar 06 '10 at 17:54
Haven't yet had time to check. I will do so, though, before the bounty runs out. Don't worry ;) – Franz Mar 08 '10 at 17:07
I feel really stupid now. I can't get the browser class to work. How can I create that kind of object? – Franz Mar 08 '10 at 18:03
"view source" creates it via XUL, see 'chrome://global/content/viewSource.xul' – Lachlan Roche Mar 08 '10 at 22:14
1

I'm experimenting with this solution and it seems to be working perfectly so far! Thank you Lachlan! @Franz I would think that creating a new one (`document.createElement('browser')`) should work, but you can also just put it in your main overlay XUL: ` ` and then of course, in your js file: `var browser = document.getElementById('invisibleBrowser')` – Tyler Jun 25 '10 at 22:40

score 2 · Accepted Answer · answered Mar 02 '10 at 14:45

2

really looks like there is no way to get "all the sourcecode". You may use

document.documentElement.innerHTML

to get the innerHTML of the top element (usually html). If you have a php error message like

<h3>fatal error</h3>
segfault

<html>
    <head>
        <title>bla</title>
        <script type="text/javascript">
            alert(document.documentElement.innerHTML);
        </script>
    </head>
    <body>
    </body>
</html>

the innerHTML would be

<head>
<title>bla</title></head><body><h3>fatal error</h3>
segfault    
        <script type="text/javascript">
            alert(document.documentElement.innerHTML);
        </script></body>

but the error message would still retain

edit: documentElement is described here: https://developer.mozilla.org/en/DOM/document.documentElement

answered Mar 02 '10 at 14:45

Phil Rykoff

11,999
3
39
63

This might be what I'm looking for. However, I don't understand the example code you posted. Is the second block supposed to be the text printed via `alert` in the first block? If so, why would the error message suddenly appear inside the `body` tag? – Franz Mar 02 '10 at 20:37
yep, the second code block was the code being alerted. Thats probably firefox's code correction. Just copy the first block into an empty html-file and try it out :-) – Phil Rykoff Mar 03 '10 at 00:10
This is not the complete source. As you noted, everything that's not between `` and `` doesn't get included. Lachlan's answer seems to be a much better solution. – Tyler Jun 25 '10 at 22:42

Sagi · Answer 3 · 2010-03-05T22:36:58.990

2

You can get URL with var URL = document.location.href and navigate to "view-source:"+URL.

Now you can fetch the whole source code (viewsource is the id of the body):

var code = document.getElementById('viewsource').innerHTML;

Problem is that the source code is formatted. So you have to run strip_tags() and htmlspecialchars_decode() to fix it.

For example, line 1 should be the doctype and line 2 should look like:

&lt;<span class="start-tag">HTML</span>&gt;

So after strip_tags() it becomes:

&lt;HTML&gt;

And after htmlspecialchars_decode() we finally get expected result:

<HTML>

The code doesn't pass to DOM parser so you can view invalid HTML too.

edited Mar 05 '10 at 22:36

answered Mar 05 '10 at 14:16

Sagi

8,009
3
26
25

Hmmm... sounds pretty good. Is the entire code wrapped in an element with ID `viewsource` or why are you doing it that way? And what do you mean by "formatted"? Are the entities escaped? – Franz Mar 05 '10 at 21:47
Think of it as a normal HTML code. The body id is viewsource. I've added example how it looks. I hope that you have some ideas how to go this page (you can do it with hidden iframe, for example). – Sagi Mar 05 '10 at 22:33
Or you could just use `.textContent` instead. – Eli Grey Mar 05 '10 at 23:57
1

Franz: You don't need all of that. Just use `document.getElementById('viewsource').textContent` – Eli Grey Mar 06 '10 at 14:43
@Eli Grey - Thanks. I verified and it works. However, comments are striped. – Sagi Mar 06 '10 at 14:52
I'll post it as an answer then that you can choose. – Eli Grey Mar 06 '10 at 16:48

Manuel Bitto · Answer 4 · 2010-03-01T14:41:03.543

1

Maybe you can get it via DOM, using

var source =document.getElementsByTagName("html");

and fetch the source using DOMParser

https://developer.mozilla.org/En/DOMParser

edited Mar 01 '10 at 14:41

answered Mar 01 '10 at 13:36

Manuel Bitto

5,073
6
39
47

getElementsByTagName (note: elements) – N 1.1 Mar 01 '10 at 14:01

score 0 · Answer 5 · answered Mar 06 '10 at 16:49

0

The first part of Sagi's answer, but use document.getElementById('viewsource').textContent instead.

answered Mar 06 '10 at 16:49

Eli Grey

35,104
14
75
93

score 0 · Answer 6 · answered Apr 12 '10 at 10:22

0

More in line with Lachlan's answer, but there is a discussion of the internals here that gets quite in depth, going into the Cpp code.

http://www.mail-archive.com/mozilla-embedding@mozilla.org/msg05391.html

and then follow the replies at the bottom.

answered Apr 12 '10 at 10:22

Daniel Gerson

2,159
1
19
29

Firefox extensions & XUL: get page source code

6 Answers6