How to get the entire document HTML as a string?

Question

Is there a way in JS to get the entire HTML within the html tags, as a string?

document.documentElement.??

The only correct answer: http://stackoverflow.com/questions/817218/how-to-get-the-entire-document-html-as-a-string#answer-35917295 (**stop up-voting inner/outerHTML answers, they do NOT provide the entire source!**) — John, Dec 31 '16 at 03:54
@bluejayke the doctype and tag itself are not included in innerHTML, and the doctype is not present in outerHTML. See paulo62’s answer; it gives the output of both — Pixelated Fish, Feb 04 '21 at 21:02
Op did not ask for the entire source, please calm down John. — Seth Jeffery, Mar 27 '22 at 17:34
**Stop upvoting John's bolded comment!** The answer he links to replaces `&&` with `&&` and so it breaks all your inline ` — joe, Oct 31 '22 at 03:58

score 399 · Accepted Answer · edited Jun 19 '23 at 15:19

399

Get the root <html> element with document.documentElement then get its .innerHTML:

const txt = document.documentElement.innerHTML;
alert(txt);

or its .outerHTML to get the <html> tag as well

const txt = document.documentElement.outerHTML;
alert(txt);

edited Jun 19 '23 at 15:19

Boris Verkhovskiy

14,854
11
100
103

answered May 03 '09 at 14:37

Colin Burnett

11,150
6
31
40

43

outerHTML doesn't get the doctype. – CMCDragonkai Apr 10 '14 at 02:50
2

worked like a charm! thank you! is there any way to get the size of any/all files linked to the document as well including js and css files? – www139 Mar 17 '15 at 02:54
@CMCDragonkai: You could [get the doctype separately](http://stackoverflow.com/a/10162353/157385) and prepend it to the markup string. Not ideal, I know, but possible. – Mike Branski Nov 19 '15 at 19:58
note that neither this nor none of these answers necessarily give you content that is the exact hash equivalent of saving the page to a file or the file generated by view-source. It seems the DOM normalizes some fields from the literal response content, like capitalising DOCTYPE headers – wesinat0r Jul 07 '20 at 00:51

score 140 · Answer 2 · edited Jun 19 '23 at 15:15

140

You can do

new XMLSerializer().serializeToString(document)

in browsers newer than IE 9

See https://caniuse.com/xml-serializer

edited Jun 19 '23 at 15:15

Boris Verkhovskiy

14,854
11
100
103

answered Mar 10 '16 at 13:01

10

This was the *first* **correct answer** according to date/time stamps. Parts of the page such as the XML declaration will *not* be included and browsers will manipulate the code when using the other "answers". This is the *only* post that should be up-voted (dos's posted three days later). People need to pay attention! – John Dec 31 '16 at 03:57
5

This is not entirely correct since it serializeToString performs an HTML encode. For example if your code contains styles defining fonts such as "Times New Roman", Times, serif the quotes will get html encoded. Perhaps that is not important to some of you but to me it is... – Marko Jun 06 '17 at 20:43
6

@John well the OP actually asks for "the entire HTML _within_ the html tags". And the selected best answer by Colin Burnett does achieve this. This particular answer (Erik's) will include the html tags and the doctype. That said, this was totally a diamond in the rough for me and exactly what I was looking for! Your comment helped too because it made me spend more time with this answer, so thanks :) – evanrmurphy Oct 26 '17 at 22:46
5

I think people should be careful with this one, specifically because it returns a value that is not the actual html that your browser receives. In my case, it added attributes to the `html` tag that the server never actually sent :( – onassar Dec 23 '18 at 19:58
For some reason this fails in Edge – kiranvj Jan 09 '19 at 11:02
I use this but no success for getting whole html source of my page...…:( – shalin gajjar May 17 '19 at 13:56
Very poor browser support unfortunately https://developer.mozilla.org/en-US/docs/Web/API/XMLSerializer – Max Mumford May 17 '19 at 16:14
1

It's supported in every browser. How is this poor browser support? – May 17 '19 at 19:26
@onassar If you page JavaScript modifies the document, it won't be the same as what the server sent. Or do you experience this even if your page has no JavaScript? – trusktr Sep 16 '21 at 01:14
**WARNING:** This breaks ` – joe Oct 31 '22 at 03:46
This converts things like `
` in html into `
` in xml... & adds things like ` – Nor.Z Feb 18 '23 at 14:30
HTML 5 documents are not valid XML, and passing the former to `serializeToString` will have a number of potentially undesired artefacts, like every `=>` (a syntactical part of an arrow function definition in a `script` element) turn out like `=>` in the serialized string, which is rejected by subsequent attempts to parse and load said document text back in the browser (e.g for displaying it as a page). – Armen Michaeli Apr 16 '23 at 17:29

score 54 · Answer 3 · edited May 23 '17 at 12:26

I tried the various answers to see what is returned. I'm using the latest version of Chrome.

The suggestion document.documentElement.innerHTML; returned <head> ... </body>

Gaby's suggestion document.getElementsByTagName('html')[0].innerHTML; returned the same.

The suggestion document.documentElement.outerHTML; returned <html><head> ... </body></html> which is everything apart from the 'doctype'.

You can retrieve the doctype object with document.doctype; This returns an object, not a string, so if you need to extract the details as strings for all doctypes up to and including HTML5 it is described here: Get DocType of an HTML as string with Javascript

I only wanted HTML5, so the following was enough for me to create the whole document:

alert('<!DOCTYPE HTML>' + '\n' + document.documentElement.outerHTML);

This is the most complete answer and should be accepted. As of 2016, browser compatibility is complete, and mentioning it in detail (as in the currently accepted answer) is no longer necessary. — Dan Dascalescu, Feb 17 '16 at 19:45

score 50 · Answer 4 · edited Dec 01 '12 at 01:42

50

I believe document.documentElement.outerHTML should return that for you.

According to MDN, outerHTML is supported in Firefox 11, Chrome 0.2, Internet Explorer 4.0, Opera 7, Safari 1.3, Android, Firefox Mobile 11, IE Mobile, Opera Mobile, and Safari Mobile. outerHTML is in the DOM Parsing and Serialization specification.

The MSDN page on the outerHTML property notes that it is supported in IE 5+. Colin's answer links to the W3C quirksmode page, which offers a good comparison of cross-browser compatibility (for other DOM features too).

edited Dec 01 '12 at 01:42

XP1

6,910
8
54
61

answered May 03 '09 at 14:36

Noldorin

144,213
56
264
302

Not all browsers support this. – Colin Burnett May 03 '09 at 14:38
@Colin: Yeah, good point. From experience, I seem to remember that both IE 6+ and Firefox support it, though the quirksmode page you linked suggests otherwise... – Noldorin May 03 '09 at 14:42
Firefox does not support OuterHTML. It is IE proprietary. https://developer.mozilla.org/En/Migrate_apps_from_Internet_Explorer_to_Mozilla#Generate_and_manipulate_content – Jesse Dearing May 03 '09 at 14:53
@Jesse: Yes, evidently. Might have been innerHTML that I used, which does have cross-browser support. – Noldorin May 03 '09 at 14:53
5

Is there a way to get everything including the doctype and the html tags? – trusktr Apr 10 '12 at 21:36
This answer is nearly identical to Colin's (and refers to it). Merge maybe? – Dan Dascalescu Feb 17 '16 at 19:43
1

Mine was first, actually. :P – Noldorin Feb 17 '16 at 19:43
@Noldorin, Why do you quote MSDN for? – Pacerier Oct 16 '17 at 15:19
@Pacerier: Why *did* I quote it, you mean? This was over 8 years ago haha! But the truth is, I think I just wanted to highlight IE support. And since IE is Microsoft-produced, MSDN seemed like a good authority on it... – Noldorin Oct 16 '17 at 16:25

score 10 · Answer 5 · answered Jun 16 '11 at 14:04

10

You can also do:

document.getElementsByTagName('html')[0].innerHTML

You will not get the Doctype or html tag, but everything else...

answered Jun 16 '11 at 14:04

Hakan

3,835
14
45
66

score 7 · Answer 6 · answered May 03 '09 at 14:36

7

document.documentElement.outerHTML

answered May 03 '09 at 14:36

Brian Campbell

322,767
57
360
340

1

Not all browsers support this. – Colin Burnett May 03 '09 at 14:38
2

Supported in Firefox 11, Chrome 0.2, Internet Explorer 4.0, Opera 7, Safari 1.3, Android, Firefox Mobile 11, IE Mobile, Opera Mobile, and Safari Mobile ([MDN](https://developer.mozilla.org/en-US/docs/DOM/element.outerHTML)). `outerHTML` is in the [DOM Parsing and Serialization](http://domparsing.spec.whatwg.org/#outerhtml) specification. – XP1 Dec 01 '12 at 01:33
Colin's answer is more detailed. – Dan Dascalescu Feb 17 '16 at 19:43

score 6 · Answer 7 · edited Aug 28 '20 at 22:46

PROBABLY ONLY IE:

>     webBrowser1.DocumentText

for FF up from 1.0:

//serialize current DOM-Tree incl. changes/edits to ss-variable
var ns = new XMLSerializer();
var ss= ns.serializeToString(document);
alert(ss.substr(0,300));

may work in FF. (Shows up the VERY FIRST 300 characters from the VERY beginning of source-text, mostly doctype-defs.)

BUT be aware, that the normal "Save As"-Dialog of FF MIGHT NOT save the current state of the page, rather the originallly loaded X/h/tml-source-text !! (a POST-up of ss to some temp-file and redirect to that might deliver a saveable source-text WITH the changes/edits prior made to it.)

Although FF surprises by good recovery on "back" and a NICE inclusion of states/values on "Save (as) ..." for input-like FIELDS, textarea etc. , not on elements in contenteditable/ designMode...

If NOT a xhtml- resp. xml-file (mime-type, NOT just filename-extension!), one may use document.open/write/close to SET the appr. content to the source-layer, that will be saved on user's save-dialog from the File/Save menue of FF. see: http://www.w3.org/MarkUp/2004/xhtml-faq#docwrite resp.

https://developer.mozilla.org/en-US/docs/Web/API/document.write

Neutral to questions of X(ht)ML, try a "view-source:http://..." as the value of the src-attrib of an (script-made!?) iframe, - to access an iframes-document in FF:

<iframe-elementnode>.contentDocument, see google "mdn contentDocument" for appr. members, like 'textContent' for instance. 'Got that years ago and no like to crawl for it. If still of urgent need, mention this, that I got to dive in ...

Gerben · Answer 8 · 2018-11-25T21:35:19.783

To also get things outside the <html>...</html>, most importantly the <!DOCTYPE ...> declaration, you could walk through document.childNodes, turning each into a string:

const html = [...document.childNodes]
    .map(node => nodeToString(node))
    .join('\n') // could use '' instead, but whitespace should not matter.

function nodeToString(node) {
    switch (node.nodeType) {
        case node.ELEMENT_NODE:
            return node.outerHTML
        case node.TEXT_NODE:
            // Text nodes should probably never be encountered, but handling them anyway.
            return node.textContent
        case node.COMMENT_NODE:
            return `<!--${node.textContent}-->`
        case node.DOCUMENT_TYPE_NODE:
            return doctypeToString(node)
        default:
            throw new TypeError(`Unexpected node type: ${node.nodeType}`)
    }
}

I published this code as document-outerhtml on npm.

edit Note the code above depends on a function doctypeToString; its implementation could be as follows (code below is published on npm as doctype-to-string):

function doctypeToString(doctype) {
    if (doctype === null) {
        return ''
    }
    // Checking with instanceof DocumentType might be neater, but how to get a
    // reference to DocumentType without assuming it to be available globally?
    // To play nice with custom DOM implementations, we resort to duck-typing.
    if (!doctype
        || doctype.nodeType !== doctype.DOCUMENT_TYPE_NODE
        || typeof doctype.name !== 'string'
        || typeof doctype.publicId !== 'string'
        || typeof doctype.systemId !== 'string'
    ) {
        throw new TypeError('Expected a DocumentType')
    }
    const doctypeString = `<!DOCTYPE ${doctype.name}`
        + (doctype.publicId ? ` PUBLIC "${doctype.publicId}"` : '')
        + (doctype.systemId
            ? (doctype.publicId ? `` : ` SYSTEM`) + ` "${doctype.systemId}"`
            : ``)
        + `>`
    return doctypeString
}

score 3 · Answer 9 · answered May 03 '09 at 14:37

3

document.documentElement.innerHTML

answered May 03 '09 at 14:37

cherouvim

31,725
15
104
153

This doesn't return the `` tag. – Dan Dascalescu Feb 17 '16 at 19:44

score 1 · Answer 10 · answered Mar 31 '11 at 23:43

1

I always use

document.getElementsByTagName('html')[0].innerHTML

Probably not the right way but I can understand it when I see it.

answered Mar 31 '11 at 23:43

gaby de wilde

35
1

This is incorrect because it won't return the `` tag. – Dan Dascalescu Feb 17 '16 at 19:47

score 1 · Answer 11 · answered Jun 09 '20 at 12:24

I am using outerHTML for elements (the main <html> container), and XMLSerializer for anything else including <!DOCTYPE>, random comments outside the <html> container, or whatever else might be there. It seems that whitespace isn't preserved outside the <html> element, so I'm adding newlines by default with sep="\n".

function get_document_html(sep="\n") {
    let html = "";
    let xml = new XMLSerializer();
    for (let n of document.childNodes) {
        if (n.nodeType == Node.ELEMENT_NODE)
            html += n.outerHTML + sep;
        else
            html += xml.serializeToString(n) + sep;
    }
    return html;
}

console.log(get_document_html().slice(0, 200));

Username Name · Answer 12 · 2021-06-02T18:09:43.337

1

This would work if you want to get everything outside the DOCTYPE:

document.getElementsByTagName('html')[0].outerHTML;

or this if you want the doctype too:

new XMLSerializer().serializeToString(document.doctype) + document.getElementsByTagName('html')[0].outerHTML;

edited Jun 02 '21 at 18:09

answered May 26 '21 at 11:18

Username Name

61
3

score 1 · Answer 13 · answered Jun 04 '23 at 05:54

1

Using querySelector

const html = document.querySelector("html").outerHTML;
console.log(html)

answered Jun 04 '23 at 05:54

mplungjan

169,008
28
173
236

kiranvj · Answer 14 · 2019-01-09T11:14:19.403

I just need doctype html and should work fine in IE11, Edge and Chrome. I used below code it works fine.

function downloadPage(element, event) {
    var isChrome = /Chrome/.test(navigator.userAgent) && /Google Inc/.test(navigator.vendor);

    if ((navigator.userAgent.indexOf("MSIE") != -1) || (!!document.documentMode == true)) {
        document.execCommand('SaveAs', '1', 'page.html');
        event.preventDefault();
    } else {
        if(isChrome) {
            element.setAttribute('href','data:text/html;charset=UTF-8,'+encodeURIComponent('<!doctype html>' + document.documentElement.outerHTML));
        }
        element.setAttribute('download', 'page.html');
    }
}

and in your anchor tag use like this.

<a href="#" onclick="downloadPage(this,event);" download>Download entire page.</a>

Example

    function downloadPage(element, event) {
     var isChrome = /Chrome/.test(navigator.userAgent) && /Google Inc/.test(navigator.vendor);
    
     if ((navigator.userAgent.indexOf("MSIE") != -1) || (!!document.documentMode == true)) {
      document.execCommand('SaveAs', '1', 'page.html');
      event.preventDefault();
     } else {
      if(isChrome) {
                element.setAttribute('href','data:text/html;charset=UTF-8,'+encodeURIComponent('<!doctype html>' + document.documentElement.outerHTML));
      }
      element.setAttribute('download', 'page.html');
     }
    }

I just need doctype html and should work fine in IE11, Edge and Chrome. 

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

<p>
<a href="#" onclick="downloadPage(this,event);"  download><h2>Download entire page.</h2></a></p>

<p>Some image here</p>

<p><img src="https://placeimg.com/250/150/animals"/></p>

score -1 · Answer 15 · edited May 23 '17 at 12:18

-1

Use document.documentElement.

Same Question answered here: https://stackoverflow.com/a/7289396/2164160

edited May 23 '17 at 12:18

Community

1
1

answered May 06 '15 at 07:10

Veer En

27
6

That question should be closed as pretty much a duplicate of this one, which is much older. Anyway, the interesting part is that you need `.outerHTML` and to get the `document.doctype`, and the most complete answer is [Paolo's](http://stackoverflow.com/a/26905999/1269037). – Dan Dascalescu Feb 17 '16 at 19:53

score -3 · Answer 16 · edited Feb 05 '19 at 22:04

-3

You have to iterate through the document childNodes and getting the outerHTML content.

in VBA it looks like this

For Each e In document.ChildNodes
    Put ff, , e.outerHTML & vbCrLf
Next e

using this, allows you to get all elements of the web page including < !DOCTYPE > node if it exists

edited Feb 05 '19 at 22:04

Eric Aya

69,473
35
181
253

answered Feb 05 '19 at 21:58

milevyo

2,165
1
13
18

score -10 · Answer 17 · answered Oct 29 '10 at 15:05

-10

The correct way is actually:

webBrowser1.DocumentText

answered Oct 29 '10 at 15:05

Damiano

1

How to get the entire document HTML as a string?

17 Answers17

Linked

Related