What regex pattern will extract the innerHTML from the containg 's outerHTML text?

Question

I have the outerHTML of the html tag in a string and want to extract the inner HTML the body tag. The function is used in C#, so I don't have access to any HTML/JavaScript Dom functionality, similar to How do i grab everything inside the BODY html tag (From a string) using RegEx Asp.net C# .

The HTML Agility route won't work because of the differences in the HTML document that occur doing the LoadHtml conversion. I capture differences the original HTML body and the HTML body as it updates on a live site. I want those differences to be compared to the original body innerHTML. The reason I want to extract the body innerHTML from the HTML outerHTML is to space on data transfer, (one transmit of html, head and body, instead of a transmit of each.

Ideally this would handle any edge case, such as attributes in the body tag, invalid html is the body tag, ect.

Obligatory [*screaming into insanity for the sake of sanity*](http://stackoverflow.com/a/1732454/451969) answer. You're welcome. `;)` — Jared Farrish, Jun 28 '12 at 23:36
Also, if you're just trying to get the markup between the `` "tags", why not just do a substring on those two positions? — Jared Farrish, Jun 28 '12 at 23:44
Mostly because I didn't think this through enough. Thank you again. — THM, Jun 28 '12 at 23:51
You could probably use a SAX parser, if you can find one for .NET. — Qtax, Jun 29 '12 at 00:22

score 0 · Answer 1 · answered Jun 28 '12 at 23:51

0

The HTML Agility route won't work because of the differences in the HTML document that occur doing the LoadHtml conversion

So load both the original and the new version with the same process and then compare them.

You lose non-infoset details like tag case, quoting and attribute order. But you already lost that anyway, since innerHTML (or outerHTML) is regenerated by the browser from the DOM infoset when you read the property; it is explicitly not the original HTML you put in.

answered Jun 28 '12 at 23:51

bobince

528,062
107
651
834

I don't capture the innerHTML of the body when it changes, I capture the diff between the original and the changed version and send just the diff. Thank you though. – THM Jun 28 '12 at 23:53
I see the [medication must be working](http://stackoverflow.com/a/1732454/451969). – Jared Farrish Jun 28 '12 at 23:55

PointedEars · Answer 2 · 2012-06-29T00:50:26.453

With

var matches = outerHTML.match(
  /<body(?:\s+[a-z]+(?:\s*=\s*(?:"[^"]*"|'[^']*'|[^\s>]+)))*\s*>([\S\s]*)<\/body>/i);

matches[1] will contain the content of the body element (this is an implementation of the parsing rules in the HTML5 WD).

But the body element is a special case because there can only be one in an HTML document, so it does not matter that the regular expression is greedy. In general, you better use a markup parser instead.

What regex pattern will extract the innerHTML from the containg 's outerHTML text?

2 Answers2