Regex - Match an end html tag if start tag is not present

Question

i want to get an ending html tag like  only if somewhere before it i.e. before any previous tags or text there is no starting  tag my sample string is

ddd d<STRONG>dfdsdsd dsdsddd<EM>ss</EM>r and</EM>and strong</STRONG>

in this string the output should be  and this also the second  because it lacks the starting . i have tried

(?!=<EM>.*)</EM>

but it doesnt seem to work please help thnks

The regex i use bring both the `s` whereas i only want the second to be matched — shabby, Dec 02 '08 at 07:33
You can always edit your own post to include additions and clarifications. They are more likely to be read this way. — Tomalak, Dec 02 '08 at 08:07

VonC · Accepted Answer · 2008-12-02T07:50:05.547

3

I am not sure regex is best suited for this kind of task, since tags can always be nested.

Anyhow, a C# regex like:

(?<!<EM>[^<]+)</EM>

would only bring the second  tag

Note that:

?! is a negative lookahead which explains why both  are found.
So... (?!=.*)xxx actually means capture xxx if it is not followed by =.*. I am not sure you wanted to include an = in there
?<! is a negative lookbehind, more suited to what you wanted to do, but which would not work with java regex engine, since this look-behind regex does not have an obvious maximum length.

However, with a .Net regex engine, as tested on RETester, it does work.

edited Dec 02 '08 at 07:50

answered Dec 02 '08 at 07:42

VonC

1,262,500
529
4,410
5,250

i tried this it isnt working it brings exactly the same matches as that of mine thanks anyway – shabby Dec 02 '08 at 08:18
it fails if an EM tag has a child tag like this: ddd ddfdsdsd dsdsdddssrdfddfs n dand strong – shabby Nov 21 '13 at 19:13
@shabby I agree. At the time (almost 5 years ago) I was merely testing the regexp based on your example in your question. It won't match every case, but since October 2011, we all know that ;) http://stackoverflow.com/a/1732454/6309 – VonC Nov 21 '13 at 19:19

score 0 · Answer 2 · edited May 23 '17 at 10:25

0

You should see the top answer to this other Stack Overflow question, because it gives the perfect answer. In short, don't use regular expressions to try to parse HTML - it's a really bad idea.

edited May 23 '17 at 10:25

Community

1
1

answered Jan 10 '14 at 22:07

Matt Cruikshank

2,932
21
24

score 0 · Answer 3 · answered Dec 02 '08 at 08:09

0

You need a pushdown automaton here. Regular expressions aren't powerful enough to capture this concept, since they are equivalent to finite-state automata, so a regex solution is strictly speaking a no-go.

That said, .NET regular expressions do have a pushdown automaton behind them so they can theoretically cope with such cases. If you really feel you need to do this with regular expressions rather than a formal HTML parser, take a glimpse here.

answered Dec 02 '08 at 08:09

Konrad Rudolph

530,221
131
937
1,214

Interesting. Isn't that an advanced form of "forward reference" ? (http://www.regular-expressions.info/brackets.html) – VonC Dec 02 '08 at 08:27
Not really: Forward references work similarly to back referecens, i.e. it's enough to store their content in an array. However, for balancing groups to work, the content of these groups has to be stored on a stack (which is the “pushdown” part). – Konrad Rudolph Dec 02 '08 at 20:40

Regex - Match an end html tag if start tag is not present

3 Answers3