1

All QA about matching something except word with negative look-ahead that I found imply lines start/end(^ $). But I can't figure out how can I match everything (any character like .*) except word before some other word in the middle of the processed text.

I should match ABC inside <tag></tag>:

...<tag>a a.__aABC&*</tag>aaa<tag>ffff</tag>...

but not outside (false-positive):

...<tag>a a.__a&*</tag>ABC<tag>ffff</tag>...

So I think I should exclude tag closing (</tag>) before ABC. I tried:

<tag>(?!<\/tag>)ABC.*?<\/tag>

but such way it doesn't allow to match .* except </tag> before ABC. How can I implement this?

Useful links:

1, 2.

z0lupka
  • 236
  • 4
  • 19
  • 3
    These are two things regexes are not good at: balanced grammars, and nested negation. As a side note, it is much better to use an XML parser in your language of choice. – Grinnz Jul 01 '19 at 15:01
  • @Grinnz Сlearly. But after all specifically described problem looks simple. It cannot be solved in any way by regex? – z0lupka Jul 01 '19 at 15:13
  • The question is tagged `perl`, so can we assume that a Perl solution is desired? If so, this sort of tricky regex issue is why the universe agrees that one should parse XML with XML parsers. See https://metacpan.org/pod/XML::LibXML , for example. – DavidO Jul 01 '19 at 15:13
  • 1
    "*But after all specifically described problem looks simple.*" -- Correct. Looks can be deceiving, and after this problem is solved there will be another, because you're reinventing what turns out to be a fairly complicated wheel. – DavidO Jul 01 '19 at 15:14
  • @z0lupka It's almost always *possible* with a sufficiently featured regex engine. But the complexity required is almost never worthwhile. It's probably important if you specify what regex engine you are using (note in particular that PCRE is not Perl). – Grinnz Jul 01 '19 at 15:14
  • @DavidO The original goal was to make it purely PCRE regex. F.e. on regex101.com – z0lupka Jul 01 '19 at 15:21
  • https://stackoverflow.com/a/4234491/716443 – DavidO Jul 01 '19 at 15:35
  • @DavidO [This](https://stackoverflow.com/questions/37240408/regular-expressions-ensuring-b-doesnt-come-between-a-and-c) helped ;) – z0lupka Jul 01 '19 at 15:39

1 Answers1

1

Since you're using one of Perl / PCRE, the fastest way to do it is like this:

/(?s)<tag>(?:<\/tag>(*SKIP)(*FAIL)|.)*?ABC.*?<\/tag>/

https://regex101.com/r/AoiwIH/1

Expanded

 (?s)
 <tag>  
 (?:
      </tag>
      (*SKIP) (*FAIL) 
   |  
      . 
 )*?
 ABC
 .*? 
 </tag>

Benchmark compare with the assertion method

Regex1:   (?s)<tag>(?:</tag>(*SKIP)(*FAIL)|.)*?ABC.*?</tag>
Completed iterations:   50  /  50     ( x 1000 )
Matches found per iteration:   1
Elapsed Time:    0.25 s,   254.91 ms,   254905 µs
Matches per sec:   196,151


Regex2:   (?s)<tag>(?:(?!</tag>).)*?ABC.*?</tag>
Completed iterations:   50  /  50     ( x 1000 )
Matches found per iteration:   1
Elapsed Time:    0.33 s,   329.10 ms,   329095 µs
Matches per sec:   151,931