0

I'm trying to match any bracketed items within <sup> tags.

My regular expression is being too greedy, starting with the first <sup> tag and ending at the last </sup> tag.

/<sup\b[^>]*>(.*?)\[(.*?)\](.*?)<\/sup>/

Example html:

<sup>[this should be gone]</sup>
<sup>but this should stay</sup>
<sup>this should [ also stay</sup>
[and this as well]
<sup><a href="#">[but this should definitely go]</a></sup>

Any idea why?

Thanks!

EDIT: I suppose these answers make sense. I've got much of the HTML parsed without regex; I just figured that this particular example would work with regex because it would do the following:

  1. see the first <sup> tag
  2. find the first instance of </sup>
  3. search the inside for (wild)(bracket)(wild)(closing bracket)(wild)
Peter
  • 4,021
  • 5
  • 37
  • 58
  • 1
    Which cases is it working on? Which is it failing on? "Too greedy" isn't quite enough information :) – Cameron Skinner Jan 04 '11 at 22:28
  • "starting with the first `` tag and ending at the last `` tag" Meaning that it would take the whole document in this case (untested on this particular example) – Peter Jan 04 '11 at 22:33
  • 2
    I'd suggest using a DOM parser to parse XML as regular expressions are not well suited to the task. – webbiedave Jan 04 '11 at 22:33

5 Answers5

2

You really can't do this. It's impossible to parse HTMl with regular expressions, because regular expressions can only match regular languages; these languages are a simpler subset of the actual languages we use. One very common non-regular language is the Dyck language of balanced brackets; it's impossible to match correctly nested parentheses with regular expressions. And HTML, if you think about it, is the same as this, with tags replacing parentheses. Thus, matching (a) correctly nested sup tags is impossible, and (b) matching balanced braces is impossible. I don't use PHP myself, but I know it has access to an HTML DOM; I'd recommend using that instead. Then, filter through that for every sup tag, and check each one's inner text. If you only want to catch tags whose inner text is just [...], where the ... does not contain square brackets, you can use ^\[[^\]]+\]$ as your regex; if you want real nesting, more complicated checking is necessary.

Community
  • 1
  • 1
Antal Spector-Zabusky
  • 36,191
  • 7
  • 77
  • 140
  • LOL. That's a fantastic link. – MattB Jan 04 '11 at 22:42
  • :P I've seen the link. I didn't really consider the query to be serious xhtml parsing (as I'm just matching a single tag), but perhaps the nested tags do indeed qualify it as the wrong approach. – Peter Jan 04 '11 at 22:53
0

If your requirement was to specifically remove any text inside "<sup>[" and "]</sup>", then you would be ok. But by your last example, you want to account for a nested tag as well, and probably arbitrary nested tags. So therefore I must remind you...

Don't parse html with regex!

Community
  • 1
  • 1
Tesserex
  • 17,166
  • 5
  • 66
  • 106
0

Isn't it the normal behavior? Have you specified the ungreedy option for your regexp?

greg0ire
  • 22,714
  • 16
  • 72
  • 101
0

You probably cannot do this with one regular expression. You will need one that replaces using a callback function, which will run a separate regular expression.

the better method as everyone has mentioned would be to use a DOM object to parse the HTML first.

dqhendricks
  • 19,030
  • 11
  • 50
  • 83
0

using regexp to parse html is usually not a very good idea.

see Parsing Html The Cthulhu Way

bw_üezi
  • 4,483
  • 4
  • 23
  • 41