How can I get the body contents out of a variable containing HTML?

Question

I have a variable htmlSource containing HTML code like this:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> 
<html xmlns="http://www.w3.org/1999/xhtml"> 
<head> 
<title>IIS 8.0 Detailed Error - 404.0 - Not Found</title> 


</head> 
<body>xxx some code here yy</body> 
</html>

How can I create a new variable htmlBodyOnly that contains only "xxx some code here yy". If possible I would like to do this with a regular expression. I am just not sure how to exclude the start and end using a regex or something similar.

Sorry but I don't have jQuery to use to help. I am working just on a javascript variable. Not working on the DOM.

do you mean that you want to get the content between the ``-tags ? — KarelG, Jun 19 '14 at 18:31
You're sure you want to use a regular expression instead of an HTML parser? http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — tvanfosson, Jun 19 '14 at 18:32
What is known about the string? Are any parts of it reliably consistent? Can there be invalid HTML? — cookie monster, Jun 19 '14 at 18:32
A cheap solution is to do `var x = document.createElement("div"); x.innerHTML = str; console.log(x);` and remove the tags left. Better solution is [DOMParser](https://developer.mozilla.org/en-US/docs/Web/API/DOMParser). — epascarello, Jun 19 '14 at 18:35
...well anyway, if your `` is always a plain tag like that, then don't bother with a regex, Just do a `var str = html.split("")[1]` and then slice away the trailing tags. Or use a simple regex that just targets the `body` and potential attributes. I don't think there's any need to parse the whole document if the string is fairly reliable. — cookie monster, Jun 19 '14 at 18:37
@Samantha_J There are no variables in the code provided. Please explain what you want. — Gavin42, Jun 19 '14 at 18:37
The string is an ASP.NET response web page of HTML. I just need to get the body out of it. Hope it's valid HTML as it's 100% created by Microsoft. — Samantha J T Star, Jun 19 '14 at 18:37
@cookiemonster - Your suggestions looks good but how could I do that as I have a and a . — Samantha J T Star, Jun 19 '14 at 18:39
Can't it/you just send the relevant data? I don't use ASP.NET, but it would seem a shame to send an entire document if you only need a piece of it. — cookie monster, Jun 19 '14 at 18:40
...on the resulting string, you could do `.slice(0, str.lastIndexOf("
"))`, though again, this assumes consistent markup is being sent. — cookie monster, Jun 19 '14 at 18:41

Shmoopy · Accepted Answer · 2014-06-19T18:46:19.460

2

This is ugly, but you can keep it as a string with this method:

htmlsource.substring(htmlsource.indexOf("<body>")+6, htmlsource.indexOf("</body>"))

The +6 is because the string "<body>" has 6 characters and the indexOf method returns the index of the first character in the string to search for.

Here's proof that it works given your example: http://jsfiddle.net/9wBkf/

This assumes that the body tag will have no attributes i.e. <body class="myClass>

edited Jun 19 '14 at 18:46

answered Jun 19 '14 at 18:40

Shmoopy

632
5
15

1

What if the body has some attributes? The best way to parse something is to build an abstract syntax tree; `DOMParser` does this for you. – 0xcaff Jun 19 '14 at 18:43
If the markup format is reliable, I think this is the way to go, though I'd probably use `.lastIndexOf()` for the `
`. Better to avoid a full parse if it isn't needed.
– cookie monster Jun 19 '14 at 18:45
@caffinatedmonkey Good point... if that's the case then regex would be the best solution, though I am not familiar enough with regex to offer help in that case... – Shmoopy Jun 19 '14 at 18:45
We could play "what if" games all day *(which could be used against DOM parsing too)*. A regex for the open tag could be `/]*>/i` – cookie monster Jun 19 '14 at 18:50

score 1 · Answer 2 · edited May 23 '17 at 10:25

1

You can use a DOMParser to parse the html and extract the content of the body. See this SO question: Converting HTML string into DOM elements?

var parser = new DOMParser()
var doc = parser.parseFromString(stringToParse, "text/html")
console.log(doc.body.innerHTML)

Here is a Fiddle!

edited May 23 '17 at 10:25

Community

1
1

answered Jun 19 '14 at 18:35

0xcaff

13,085
5
47
55

score 0 · Answer 3 · edited May 23 '17 at 12:27

0

I do not know which regular expression you can use for that, but I think I know an alternative solution. You can also 'convert' your var to a DOM-object and then read the body-child.

Converting HTML string into DOM elements?

edited May 23 '17 at 12:27

Community

1
1

answered Jun 19 '14 at 18:34

dejakob

2,062
14
21

How can I get the body contents out of a variable containing HTML?

3 Answers3