2

i want to get an ending html tag like </EM> only if somewhere before it i.e. before any previous tags or text there is no starting <EM> tag my sample string is

ddd d<STRONG>dfdsdsd dsdsddd<EM>ss</EM>r and</EM>and strong</STRONG>

in this string the output should be </EM> and this also the second </EM> because it lacks the starting <EM>. i have tried

(?!=<EM>.*)</EM>

but it doesnt seem to work please help thnks

shabby
  • 3,002
  • 3
  • 39
  • 59
  • The regex i use bring both the `s` whereas i only want the second to be matched – shabby Dec 02 '08 at 07:33
  • You can always edit your own post to include additions and clarifications. They are more likely to be read this way. – Tomalak Dec 02 '08 at 08:07

3 Answers3

3

I am not sure regex is best suited for this kind of task, since tags can always be nested.

Anyhow, a C# regex like:

(?<!<EM>[^<]+)</EM>

would only bring the second </EM> tag

Note that:

  • ?! is a negative lookahead which explains why both </EM> are found.
    So... (?!=<EM>.*)xxx actually means capture xxx if it is not followed by =<EM>.*. I am not sure you wanted to include an = in there
  • ?<! is a negative lookbehind, more suited to what you wanted to do, but which would not work with java regex engine, since this look-behind regex does not have an obvious maximum length.

However, with a .Net regex engine, as tested on RETester, it does work.

VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250
  • i tried this it isnt working it brings exactly the same matches as that of mine thanks anyway – shabby Dec 02 '08 at 08:18
  • it fails if an EM tag has a child tag like this: ddd ddfdsdsd dsdsdddssrdfddfs n dand strong – shabby Nov 21 '13 at 19:13
  • @shabby I agree. At the time (almost 5 years ago) I was merely testing the regexp based on your example in your question. It won't match every case, but since October 2011, we all know that ;) http://stackoverflow.com/a/1732454/6309 – VonC Nov 21 '13 at 19:19
0

You should see the top answer to this other Stack Overflow question, because it gives the perfect answer. In short, don't use regular expressions to try to parse HTML - it's a really bad idea.

Community
  • 1
  • 1
Matt Cruikshank
  • 2,932
  • 21
  • 24
0

You need a pushdown automaton here. Regular expressions aren't powerful enough to capture this concept, since they are equivalent to finite-state automata, so a regex solution is strictly speaking a no-go.

That said, .NET regular expressions do have a pushdown automaton behind them so they can theoretically cope with such cases. If you really feel you need to do this with regular expressions rather than a formal HTML parser, take a glimpse here.

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
  • Interesting. Isn't that an advanced form of "forward reference" ? (http://www.regular-expressions.info/brackets.html) – VonC Dec 02 '08 at 08:27
  • Not really: Forward references work similarly to back referecens, i.e. it's enough to store their content in an array. However, for balancing groups to work, the content of these groups has to be stored on a stack (which is the “pushdown” part). – Konrad Rudolph Dec 02 '08 at 20:40