Match pattern within pattern

Question

I'm trying to match any bracketed items within  tags.

My regular expression is being too greedy, starting with the first  tag and ending at the last  tag.

/<sup\b[^>]*>(.*?)\[(.*?)\](.*?)<\/sup>/

Example html:

<sup>[this should be gone]</sup>
<sup>but this should stay</sup>
<sup>this should [ also stay</sup>
[and this as well]
<sup><a href="#">[but this should definitely go]</a></sup>

Any idea why?

Thanks!

EDIT: I suppose these answers make sense. I've got much of the HTML parsed without regex; I just figured that this particular example would work with regex because it would do the following:

see the first  tag
find the first instance of 
search the inside for (wild)(bracket)(wild)(closing bracket)(wild)

Which cases is it working on? Which is it failing on? "Too greedy" isn't quite enough information :) — Cameron Skinner, Jan 04 '11 at 22:28
"starting with the first `^{` tag and ending at the last `}` tag" Meaning that it would take the whole document in this case (untested on this particular example) — Peter, Jan 04 '11 at 22:33
I'd suggest using a DOM parser to parse XML as regular expressions are not well suited to the task. — webbiedave, Jan 04 '11 at 22:33

score 2 · Accepted Answer · edited May 23 '17 at 12:26

You really can't do this. It's impossible to parse HTMl with regular expressions, because regular expressions can only match regular languages; these languages are a simpler subset of the actual languages we use. One very common non-regular language is the Dyck language of balanced brackets; it's impossible to match correctly nested parentheses with regular expressions. And HTML, if you think about it, is the same as this, with tags replacing parentheses. Thus, matching (a) correctly nested sup tags is impossible, and (b) matching balanced braces is impossible. I don't use PHP myself, but I know it has access to an HTML DOM; I'd recommend using that instead. Then, filter through that for every sup tag, and check each one's inner text. If you only want to catch tags whose inner text is just [...], where the ... does not contain square brackets, you can use ^\[[^\]]+\]$ as your regex; if you want real nesting, more complicated checking is necessary.

:P I've seen the link. I didn't really consider the query to be serious xhtml parsing (as I'm just matching a single tag), but perhaps the nested tags do indeed qualify it as the wrong approach. — Peter, Jan 04 '11 at 22:53

score 0 · Answer 2 · edited May 23 '17 at 11:48

0

If your requirement was to specifically remove any text inside "[" and "]</sup>", then you would be ok. But by your last example, you want to account for a nested tag as well, and probably arbitrary nested tags. So therefore I must remind you...

Don't parse html with regex!

edited May 23 '17 at 11:48

Community

1
1

answered Jan 04 '11 at 22:32

Tesserex

17,166
5
66
106

score 0 · Answer 3 · answered Jan 04 '11 at 22:33

0

Isn't it the normal behavior? Have you specified the ungreedy option for your regexp?

answered Jan 04 '11 at 22:33

greg0ire

22,714
16
72
101

score 0 · Answer 4 · answered Jan 04 '11 at 22:34

You probably cannot do this with one regular expression. You will need one that replaces using a callback function, which will run a separate regular expression.

the better method as everyone has mentioned would be to use a DOM object to parse the HTML first.

score 0 · Answer 5 · answered Jan 04 '11 at 22:44

0

using regexp to parse html is usually not a very good idea.

see Parsing Html The Cthulhu Way

answered Jan 04 '11 at 22:44

bw_üezi

4,483
4
23
41

Match pattern within pattern

5 Answers5