Regular Expression find a phrase not inside an HTML tag

Question

I'm struggling a bit with this regular expression and wondered if anyone was about to help me please?

What I need to do is isolate the 1st phrase inside a string which is NOT inside an HTML tag. So the examples I have at the moment are:

This is some test text about <acronym
title="Incomplete Test Syndrome"
class="CustomClass">ITS</acronym> for
the **ITS** department. Also worth
mentioning ABS as well I guess.ITS,

... and ...

This is some **ITS** test text about
<acronym title="Incomplete Test
Syndrome"
class="GOTManager">ITS</acronym> for
the ITS department. Also worth
mentioning ABS as well I guess

So in the first example I want it to ignore the wrapped ITS and give me the ITS at the end of the 1st sentence.

In the second example I want it to return the ITS at the start of the 2nd sentence.

The aim is to replace these with my own custom wrapped acronym tags in a ColdFusion application I'm writing.

Thanks a lot, James

[YOU CANNOT PARSE HTML USING Regular Expressions](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags)! — SLaks, May 05 '10 at 14:27
@James: you might not *want* to parse HTML, but you'll have to to achieve these results. — FrustratedWithFormsDesigner, May 05 '10 at 14:33
To say you CANNOT is a bit of a vague and to be honest "answer machine" reply. I tried to Google this before I posted and it seemed like most SO results I came across had someone slapping that answer in there. It's not easy to do I agree and the results can be unpredictable but really its not impossible and there are working examples of it. We work with HTML every day so you can't avoid it ;-) — James Buckingham, May 05 '10 at 14:38
Ok well what regex have you tried so far, and what result does it give? — FrustratedWithFormsDesigner, May 05 '10 at 14:44
Thanks F. :-) I'm reworking things at the moment because of another change I've made. Before I was stripping out all acronym tags and rebuild the cleaned up string but now I'm using ]*?GOTManager[^>]*>(.*?) to take out just the GOTManager ones. Originally I had a simple /bITS/b but now I'm needing this new one. It's work in progress but once I get something semi-worth posting I'll stick it in ;-) — James Buckingham, May 05 '10 at 15:08
It's impossible to handle nested tags, which you aren't doing. — SLaks, May 05 '10 at 15:13
HTML CAN be parsed with regex, but only if you know the maximum depth of nested tags. Writing a regex for depth > 1 is nasty, and writing a regex to parse any HTML is impossible. If you have a regularly formatted XML/HTML text AND if it's easy to use regex in that case, I can't see a reason to not use it. I wouldn't import a whole parser library just to extract some text in `li` tags. But if I spend 5 minutes and still can't write a working regex, I would stop right there and use the right tool. — tiftik, May 05 '10 at 15:16
Thanks Tiftik and Jens. This isn't going to be anything rocket science. The user's aren't inserting lists or tables etc. It's a simple piece of text with a few basic style tags. I'll give that a shot though Jens, it's appreciated :-) — James Buckingham, May 05 '10 at 15:19
The comments here provide a strong case for the ability to downvote comments. — Hooray Im Helping, May 05 '10 at 15:28
@James, is there *any* reason for your obsession with regular expressions? Use an HTML parser, then your problem becomes trivial. Don’t use one, get crappy help on Stack Overflow. If nothing else, it would be faster – you would probably already be done. — Konrad Rudolph, May 05 '10 at 15:29
@Konrad If I was doing anything complex then yeah a parser would be a better solution but the example I've given is as complicated as the HTML is going to get :-) — James Buckingham, May 05 '10 at 15:34
@James Buckingham: … and **still** nobody has dared to post a correct, robust regular expression to solve your problem. Just for kicks, your expression posted in a comment above is wrong (e.g. it chokes on `GOTManager` which is valid HTML). Can’t you just accept that even your simple HTML is *hellishly* complex to parse with regular expressions? — Konrad Rudolph, May 05 '10 at 15:51
@James: Just an idea: Would an XSL transformation be more appropriate to what you're trying to do? — FrustratedWithFormsDesigner, May 05 '10 at 16:11
Thanks very much everyone for your comments & feedback. I'm willing to admit defeat on the RegEx approach (hooray you say!) & I'll have a look into this HTML Parser / XML approach today instead. I did a bit of digging around last night and found a Java based one called TagSoup. So I'll have a go of that along with your suggestions Frustrated (I've no XSL experience so this'll be fun!) and see how I get on. James — James Buckingham, May 06 '10 at 08:44

Jens · Answer 1 · 2010-05-05T15:20:55.093

As the commentators have pointed out, regular expressions are not a good tool to work with XML/HTML-like texts. That is because being "inside" something is very hard to check for in any generality (you never know in which of these possible unlimited nesting levels you are).

For your particular examples, though, it possible to do. This heavily relies on not having any nested tags. If you do, you should seriously try a different approach.

Your examples work with

^(?:<[^<]*<[^>]*>|.)*?(ITS)

This matches the entire string up to the first occurance of ITS not in a tag (and has this in its first capturing group), but it should be easy to extract the data you need there. Only matching this instance of ITS is not possible, since your implementation of regular expressions does not support arbitrary length look-behinds.

Ask if you want/need the expression explained. =)

score 0 · Answer 2 · edited May 23 '17 at 12:30

I will tell you the same thing I told you when you asked a very similar question: Stuck with Regular Expression code to apply HTML tag to text but exclude if inside <?> tag

You CANNOT parse HTML, including nested elements, with pure regular expressions. This is a known limitation of regex and is well documented.

You can try installing and using an external regular expression engine with extensions, which might work. You can manually walk the string, counting the nesting to see if the string you are looking at is wrapped. You can use a genuine HTML parser, like WebKIT do do this externally.

But you can't do it with regex. Please look for an alternative. Heck, we'll even help.

FrustratedWithFormsDesigner · Answer 3 · 2010-05-05T18:23:14.410

You say:

The aim is to replace these with my own custom wrapped acronym tags in a ColdFusion application I'm writing.

It sounds like using XSL might be more appropriate than regex to transform one tag into another.

UPDATE:

Just threw this together, it seems to work for simple cases:

(NOTE: this will simply strip out the 'acronym' tags. You could use XSL to replace them with your own custom tags, but you didn't specify anything along those lines so I didn't get into that)

XSL:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:template match="*[name() = 'acronym']" />
</xsl:stylesheet>

Input:

<?xml version="1.0" encoding="UTF-8"?>
<root>
This is some test text about <acronym
title="Incomplete Test Syndrome"
class="CustomClass">ITS</acronym> for
the **ITS** department. Also worth
mentioning ABS as well I guess.ITS,

This is some **ITS** test text about
<acronym title="Incomplete Test
Syndrome"
class="GOTManager">ITS</acronym> for
the ITS department. Also worth
mentioning ABS as well I guess
</root>

Output:

<?xml version="1.0" encoding="UTF-8"?>
This is some test text about  for
the **ITS** department. Also worth
mentioning ABS as well I guess.ITS,

This is some **ITS** test text about
 for
the ITS department. Also worth
mentioning ABS as well I guess

UPDATE:

You said:

So in the first example I want it to ignore the wrapped ITS and give me the ITS at the end of the 1st sentence.

In the second example I want it to return the ITS at the start of the 2nd sentence.

This makes no sense. Your second example doesn't have "ITS" in the second sentence. I think what you meant was that the **ITS** is what you want to have extracted.

The XSL sample I gave only strips the <acronym/> tags, but after that's done you can try to find the ITS at different points in the sentence and maybe for that a regex might be easy (this assumes that you're ONLY have to worry about the <acronym/> tags).

Regular Expression find a phrase not inside an HTML tag

3 Answers3