0

I have the outerHTML of the html tag in a string and want to extract the inner HTML the body tag. The function is used in C#, so I don't have access to any HTML/JavaScript Dom functionality, similar to How do i grab everything inside the BODY html tag (From a string) using RegEx Asp.net C# .

The HTML Agility route won't work because of the differences in the HTML document that occur doing the LoadHtml conversion. I capture differences the original HTML body and the HTML body as it updates on a live site. I want those differences to be compared to the original body innerHTML. The reason I want to extract the body innerHTML from the HTML outerHTML is to space on data transfer, (one transmit of html, head and body, instead of a transmit of each.

Ideally this would handle any edge case, such as attributes in the body tag, invalid html is the body tag, ect.

Community
  • 1
  • 1
THM
  • 11
  • 3

2 Answers2

0

The HTML Agility route won't work because of the differences in the HTML document that occur doing the LoadHtml conversion

So load both the original and the new version with the same process and then compare them.

You lose non-infoset details like tag case, quoting and attribute order. But you already lost that anyway, since innerHTML (or outerHTML) is regenerated by the browser from the DOM infoset when you read the property; it is explicitly not the original HTML you put in.

bobince
  • 528,062
  • 107
  • 651
  • 834
  • I don't capture the innerHTML of the body when it changes, I capture the diff between the original and the changed version and send just the diff. Thank you though. – THM Jun 28 '12 at 23:53
  • I see the [medication must be working](http://stackoverflow.com/a/1732454/451969). – Jared Farrish Jun 28 '12 at 23:55
0

With

var matches = outerHTML.match(
  /<body(?:\s+[a-z]+(?:\s*=\s*(?:"[^"]*"|'[^']*'|[^\s>]+)))*\s*>([\S\s]*)<\/body>/i);

matches[1] will contain the content of the body element (this is an implementation of the parsing rules in the HTML5 WD).

But the body element is a special case because there can only be one in an HTML document, so it does not matter that the regular expression is greedy. In general, you better use a markup parser instead.

PointedEars
  • 14,752
  • 4
  • 34
  • 33