1

I'm trying to capture certain parts of HTML using regular expressions, and I've come across a situation which I don't know how to resolve.

I've got an HTML fragment like this:

<span ...> .... <span ...> ... </span> ... </span>

so, a <span> element into which another <span> element is nested.

I've been successfully using the following regex (in PHP's preg_match() / preg_match_all()) to capture entire HTML elements:

@<sometag[^>]+>.*?</sometag>@

This would capture a given starting tag and everything up to the closing tag of the same type.

However, in the situation above, this would capture the starting <span> and everything up to the next closing </span> encountered, so what I get is this:

<span ...> .... <span ...> ... </span>

that is, the outer starting tag, then everything until the starting tag of the inner span, then everything up to the closing tag of the inner span, which, of course, is not what I want.

What I really wanted is the outer <span> element complete with everything that is inside it, including the inner nested <span>.

Is there any practical way to achieve this?

Note: parsing the HTML using an XML parser is probably not an option, as the HTML I'm working on is old and very broken HTML 4 coming out of MS FrontPage that any parser would choke on.

Thanks for any help!

Niels Heidenreich
  • 1,257
  • 1
  • 9
  • 20
  • Oops, when I said "capture", what I meant was "match". – Niels Heidenreich Aug 11 '10 at 09:46
  • 2
    You can [edit the post](http://stackoverflow.com/posts/3457072/edit) (the button is towards bottom left side of the question). I guess @Tom was referring to the fact that you're trying to parse html/xml with regex. Have you read [the most upvoted question on SO](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454)? – Amarghosh Aug 11 '10 at 10:10
  • Allright, I think I learned something today. Thanks! – Niels Heidenreich Aug 11 '10 at 10:50

1 Answers1

3

Obviously, the "right" answer is to use a DOM parser instead of regex, but you say your markup is too broken for a parser.

Before resorting to a regex, though, check out whether simpleHTMLDOM can make sense out of it. it is a bit more lenient towards broken markup than the PHP DOM based parsers.

Pekka
  • 442,112
  • 142
  • 972
  • 1,088