0

I have a variable htmlSource containing HTML code like this:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> 
<html xmlns="http://www.w3.org/1999/xhtml"> 
<head> 
<title>IIS 8.0 Detailed Error - 404.0 - Not Found</title> 


</head> 
<body>xxx some code here yy</body> 
</html>

How can I create a new variable htmlBodyOnly that contains only "xxx some code here yy". If possible I would like to do this with a regular expression. I am just not sure how to exclude the start and end using a regex or something similar.

Sorry but I don't have jQuery to use to help. I am working just on a javascript variable. Not working on the DOM.

Samantha J T Star
  • 30,952
  • 84
  • 245
  • 427
  • do you mean that you want to get the content between the ``-tags ? – KarelG Jun 19 '14 at 18:31
  • 2
    You're sure you want to use a regular expression instead of an HTML parser? http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – tvanfosson Jun 19 '14 at 18:32
  • What is known about the string? Are any parts of it reliably consistent? Can there be invalid HTML? – cookie monster Jun 19 '14 at 18:32
  • @tvanfosson: OP doesn't want to parse the entire document. – cookie monster Jun 19 '14 at 18:34
  • 1
    A cheap solution is to do `var x = document.createElement("div"); x.innerHTML = str; console.log(x);` and remove the tags left. Better solution is [DOMParser](https://developer.mozilla.org/en-US/docs/Web/API/DOMParser). – epascarello Jun 19 '14 at 18:35
  • ...well anyway, if your `` is always a plain tag like that, then don't bother with a regex, Just do a `var str = html.split("")[1]` and then slice away the trailing tags. Or use a simple regex that just targets the `body` and potential attributes. I don't think there's any need to parse the whole document if the string is fairly reliable. – cookie monster Jun 19 '14 at 18:37
  • @Samantha_J There are no variables in the code provided. Please explain what you want. – Gavin42 Jun 19 '14 at 18:37
  • The string is an ASP.NET response web page of HTML. I just need to get the body out of it. Hope it's valid HTML as it's 100% created by Microsoft. – Samantha J T Star Jun 19 '14 at 18:37
  • @cookiemonster - Your suggestions looks good but how could I do that as I have a and a . – Samantha J T Star Jun 19 '14 at 18:39
  • @SamanthaJ that's funny – tvanfosson Jun 19 '14 at 18:40
  • Can't it/you just send the relevant data? I don't use ASP.NET, but it would seem a shame to send an entire document if you only need a piece of it. – cookie monster Jun 19 '14 at 18:40
  • ...on the resulting string, you could do `.slice(0, str.lastIndexOf("

    "))`, though again, this assumes consistent markup is being sent.

    – cookie monster Jun 19 '14 at 18:41

3 Answers3

2

This is ugly, but you can keep it as a string with this method:

htmlsource.substring(htmlsource.indexOf("<body>")+6, htmlsource.indexOf("</body>"))

The +6 is because the string "<body>" has 6 characters and the indexOf method returns the index of the first character in the string to search for.

Here's proof that it works given your example: http://jsfiddle.net/9wBkf/

This assumes that the body tag will have no attributes i.e. <body class="myClass>

Shmoopy
  • 632
  • 5
  • 15
  • 1
    What if the body has some attributes? The best way to parse something is to build an abstract syntax tree; `DOMParser` does this for you. – 0xcaff Jun 19 '14 at 18:43
  • If the markup format is reliable, I think this is the way to go, though I'd probably use `.lastIndexOf()` for the `

    `. Better to avoid a full parse if it isn't needed.

    – cookie monster Jun 19 '14 at 18:45
  • @caffinatedmonkey Good point... if that's the case then regex would be the best solution, though I am not familiar enough with regex to offer help in that case... – Shmoopy Jun 19 '14 at 18:45
  • We could play "what if" games all day *(which could be used against DOM parsing too)*. A regex for the open tag could be `/]*>/i` – cookie monster Jun 19 '14 at 18:50
1

You can use a DOMParser to parse the html and extract the content of the body. See this SO question: Converting HTML string into DOM elements?

var parser = new DOMParser()
var doc = parser.parseFromString(stringToParse, "text/html")
console.log(doc.body.innerHTML)

Here is a Fiddle!

Community
  • 1
  • 1
0xcaff
  • 13,085
  • 5
  • 47
  • 55
0

I do not know which regular expression you can use for that, but I think I know an alternative solution. You can also 'convert' your var to a DOM-object and then read the body-child.

Converting HTML string into DOM elements?

Community
  • 1
  • 1
dejakob
  • 2,062
  • 14
  • 21