How do I get the HTML source from the page?

Question

Is there a way to access the page HTML source code using javascript?

I know that I can use document.body.innerHTML but it contains only the code inside the body. I want to get all the page source code including head and body tags with their content, and, if it's possible, also the html tag and the doctype. Is it possible?

Possible duplicate of [How to get the entire document HTML as a string?](https://stackoverflow.com/questions/817218/how-to-get-the-entire-document-html-as-a-string) — wesinat0r, Oct 14 '19 at 00:24

score 48 · Accepted Answer · edited Jul 21 '15 at 14:57

48

Use

document.documentElement.outerHTML

or

document.documentElement.innerHTML

edited Jul 21 '15 at 14:57

gunr2171

16,104
25
61
88

answered Sep 02 '09 at 13:07

Eldar Djafarov

23,327
2
33
27

i don't know why in Firefox the document.documentElement object doesn't have the outerHTML property, but with the innerHTML i can get almost everything except the doctype so thank you! – mck89 Sep 02 '09 at 13:14
8

@mck89: no browser but IE will have `outerHTML`. – Crescent Fresh Sep 02 '09 at 13:21
6

Be aware that the source you get with Firefox/most browsers is the "true" source you served up. In IE you will get the "live" HTML of the page including any changes the user has made to forms, any new DOM content etc. In IE it will also be the mixed case invalid tag soup that IE provides when requesting the .innerHTML of elements. – scunliffe Sep 02 '09 at 13:35
2

In case anyone else is still looking into this, the situation has changed somewhat. @Crescent Fresh was correct 2 years ago, however more recent versions of Chrome and Safari also implement HTMLELement.outerHTML - though at the time of writing, Firefox does not. – Liam Newmarch Aug 19 '11 at 10:32
3

@LiamNewmarch 2 years after your comment, which was 2 years after the initial post, and it seems that now Firefox also implements outerHTML. :) – Kip Aug 12 '13 at 14:50
12

This is the current state of the DOM not the source code. – Lothar May 10 '15 at 08:37

score 19 · Answer 2 · answered Jul 03 '13 at 14:40

19

This can be done in a one-liner using XMLSerializer.

var generatedSource = new XMLSerializer().serializeToString(document);

Which gives String

<!DOCTYPE html><html><head>

<title>html - javascript page source code - Stack Overflow</title>
...

answered Jul 03 '13 at 14:40

Paul S.

64,864
9
122
138

Unfortunately you will get garbage if the document content has any character that requires escaping in XML. Also you will not get the real original string but something slightly different (e.g. including an XML schema link). – 6502 Mar 07 '21 at 07:27

score 11 · Answer 3 · answered Sep 02 '09 at 13:08

11

One way to do this would be to re-request the page using XMLHttpRequest, then you'll get the entire page verbatim from the web server.

answered Sep 02 '09 at 13:08

Paul Dixon

295,876
54
310
348

Note that servers do not necessarily respond in exactly the same way to two individual requests. – mindplay.dk Sep 01 '21 at 11:12

score 2 · Answer 4 · edited Oct 27 '22 at 10:42

2

For IE you can also use:

document.all[0].outerHTML

edited Oct 27 '22 at 10:42

L8R

401
5
21

answered Sep 02 '09 at 13:23

DmitryK

5,542
1
22
32

Surprised this isn't marked as the answer. This works perfectly! The only thing is it only gets static HTML (doesn't retrieve anything javascript-related). – L8R Oct 26 '22 at 21:34

czerny · Answer 5 · 2018-04-24T16:06:22.133

2

Provided that

true html source code is wanted (not current DOM serization)
and that the page was loaded using GET method,

the page source can be re-downloaded:

fetch(document.location.href)
    .then(response => response.text())
    .then(pageSource => /* ... */)

edited Apr 24 '18 at 16:06

answered Jun 24 '17 at 23:15

czerny

15,090
14
68
96

1

That is unreliable because there is no guarentee that the server will serve the same content next time. – Szczepan Hołyszewski Sep 23 '17 at 02:43
@SzczepanHołyszewski Given that the REST protocol is defined as [stateless](https://stackoverflow.com/q/34130036/9063935), as long as you send the same headers in the ajax request as the browser did, then I would be confident the server would send the same response. – dwb Sep 19 '20 at 20:42
1

@dantechguy What are you talking about? There is nothing in the OP about REST. Whether an endpoint is a REST one depends on the server. The `fetch` API is typically used by client-side JS to talk to REST endpoints, but using the `fetch` API on a non-REST endpoint doesn't magically turn it into a REST one. But even if we talk REST, statelessness is irrelevant. Two identical REST GET requests can return different data if the resource was actually modified between the requests, or your permission to access the resource was revoked, or for a number of other reasons. – Szczepan Hołyszewski Sep 23 '20 at 13:24
You make this a bit more reliable by at least adding an `Accept` header similar to that of the browser. But yeah, this approach is not generally reliable. – mindplay.dk Sep 01 '21 at 11:14
This worked for me! this youtube url has timedtext (transcription) in 'view page source' and could only retrieve this by fetching the url again. https://www.youtube.com/watch?v=LA-LMRFhzaw&ab_channel=jordifieke – Wim den Herder May 28 '22 at 19:05

How do I get the HTML source from the page?

5 Answers5

Linked

Related