How, using regex, can I capture the outer HTML element, when the same element type is nested within it?

Question

I'm trying to capture certain parts of HTML using regular expressions, and I've come across a situation which I don't know how to resolve.

I've got an HTML fragment like this:

<span ...> .... <span ...> ... </span> ... </span>

so, a  element into which another  element is nested.

I've been successfully using the following regex (in PHP's preg_match() / preg_match_all()) to capture entire HTML elements:

@<sometag[^>]+>.*?</sometag>@

This would capture a given starting tag and everything up to the closing tag of the same type.

However, in the situation above, this would capture the starting  and everything up to the next closing  encountered, so what I get is this:

<span ...> .... <span ...> ... </span>

that is, the outer starting tag, then everything until the starting tag of the inner span, then everything up to the closing tag of the inner span, which, of course, is not what I want.

What I really wanted is the outer  element complete with everything that is inside it, including the inner nested .

Is there any practical way to achieve this?

Note: parsing the HTML using an XML parser is probably not an option, as the HTML I'm working on is old and very broken HTML 4 coming out of MS FrontPage that any parser would choke on.

Thanks for any help!

You can [edit the post](http://stackoverflow.com/posts/3457072/edit) (the button is towards bottom left side of the question). I guess @Tom was referring to the fact that you're trying to parse html/xml with regex. Have you read [the most upvoted question on SO](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454)? — Amarghosh, Aug 11 '10 at 10:10

score 3 · Accepted Answer · answered Aug 11 '10 at 09:40

3

Obviously, the "right" answer is to use a DOM parser instead of regex, but you say your markup is too broken for a parser.

Before resorting to a regex, though, check out whether simpleHTMLDOM can make sense out of it. it is a bit more lenient towards broken markup than the PHP DOM based parsers.

answered Aug 11 '10 at 09:40

Pekka

442,112
142
972
1,088

thanks for the library. i'm looking forward to trying it out! – mraaroncruz Aug 11 '10 at 09:46

How, using regex, can I capture the outer HTML element, when the same element type is nested within it?

1 Answers1

Linked