0

This question sounds like a duplicate, but I've looked at a LOT of similar questions, and none fit the bill either because they restrict their question to a very specific example, or to a specific usercase (e.g: single chars only) or because you need substitution for a successful approach, or because you'd need to use a programming language (e.g: C#'s split, or Match().Value).

I want to be able to get the reverse of any arbitrary Regex expression, so that everything is matched EXCEPT the found match.

For example, let's say I want to find the reverse of the Regex "over" in "The cow jumps over the moon", it would match The cow jumps and also match the moon.

That's only a simple example of course. The Regex could be something more messy such as "o.*?m", in which case the matches would be: The c, ps, and oon.

Here is one possible solution I found after ages of hunting. Unfortunately, it requires the use of substitution in the replace field which I was hoping to keep clear. Also, everything else is matched, but only a character by character basis instead of big chunks.

Just to stress again, the answer should be general-purpose for any arbitrary Regex, and not specific to any particular example.

Community
  • 1
  • 1
Dan W
  • 3,520
  • 7
  • 42
  • 69
  • 1
    have a look at, say in python, `re.split`, i think it satisfies your requirement very much. – Jason Hu Jul 01 '15 at 16:19
  • Which lang you're running? – Avinash Raj Jul 01 '15 at 16:20
  • You should probably tag this as C# if that's the language you are using (same as the other solution you linked). – Joseph Marikle Jul 01 '15 at 16:21
  • Forgot to say, I don't want to use language features such as Split or Match. I want the Regex to do this entirely by itself. – Dan W Jul 01 '15 at 16:21
  • then you will end up implementing `split` yourself. think about it carefully. i doubt you are having xy problem now. – Jason Hu Jul 01 '15 at 16:22
  • @HuStmpHrrr: Even though I am coding this in C#, my users won't necessarily be using the program for the purpose of putting their generated Regex into their own program. They may just be using the final text as the final product, or using the Regex elsewhere where implementation of Split may differ or not exist at all. – Dan W Jul 01 '15 at 16:32
  • I'm not positive, but it's possible that putting the regex in a negative lookahead will work. – Justin Jul 01 '15 at 16:35
  • @DanW you have too many things to worry about then, do you know why people come and ask you what language or engine you are working on? because regex is nothing but a **group** of languages. i learnt over 5 different variation of regexes so how can you expect your regex would work in different engine? if you can't fix your engine, that's too much work for current context. – Jason Hu Jul 01 '15 at 16:35
  • @HuStmpHrr: Again, the Regex itself should preferably work across many flavours, and not be specific to any particular style. The [program](http://www.skytopia.com/software/wildgem/) I'm creating has a full GUI, and my users won't be using some kind of programming reflection within the GUI or anything to imitate Split(). I'm not really sure I'm understanding you. Also bear in mind, they may not use the resulting Regex anywhere else, just the generated text itself. – Dan W Jul 01 '15 at 16:41
  • @DanW why not just put a `match reversely` button there and reverse the highlight if it's clicked... – Jason Hu Jul 01 '15 at 16:46
  • @HuStmpHrrr: I did consider that, and it's a good idea, but there are a couple of drawbacks. One is they can't use the regex elsewhere if they need to, and two, I will need to improvise to make the Replace() code I already have in place work. My program doesn't just highlight the found matches according to a pattern; it also replaces what the user may enter in a separate text box. – Dan W Jul 01 '15 at 16:50

2 Answers2

0

From post: I want to be able to get the reverse of any arbitrary Regex expression, so that everything is matched EXCEPT the found match.

The answer -
A match is Not Discontinuous, it is continuous !!

Each match is a continuous, unbroken substring. So, within each match there
is no skipping anything within that substring. Whatever matched the
regular expression is included in a particular match result.

So within a single Match, there is no inverting (i.e. match not this only) that can extend past
a negative thing.

This is a Tennant of Regular Expressions.

Further, in this case, since you only want all things NOT something, you have
to consume that something in the process.
This is easily done by just capturing what you want.

So, even with multiple matches, its not good enough to say (?:(?!\bover\b).)+
because even though it will match up to (but not) over, on the next match
it will match ver ....

There are ways to avoid this that are tedious, requiring variable length lookbehinds.
But, the easiest way is to match up to over, then over, then the rest.
Several constructs can help. One is \K.

0

Unfortunately, there is no magical recipe to negate a pattern.

As you mentioned it in your question when you have an efficient pattern you use with a match method, to obtain the complementary, the more easy (and efficient) way is to use a split method with the same pattern.

To do it with the pattern itself, workarounds are:

1. consuming the characters that match the pattern

"other content" is the content until the next pattern or the end of the string.

alternation + capture group:

(pattern)|other content

Then you must check if the capture group exists to know which part of the alternation succeeds.

"other content" can be for example described in this way: .*?(?=pattern|$)

With PCRE and Perl, you can use backtracking control verbs to avoid the capture group, but the idea is the same:

pattern(*SKIP)(*FAIL)|other content

With this variant, you don't need to check anything after, since the first branch is forced to fail.

or without alternation:

((?:pattern)*)(other content)

variant in PCRE, Perl, or Ruby with the \K feature:

(?:pattern)*\Kother content

Where \K removes all on the left from the match result.

2. checking characters of the string one by one

(?:(?!pattern).)*

if this way is very simple to write (if the lookahead is available), it has the inconvenient to be slow since each positions of the string are tested with the lookahead.

The amount of lookahead tests can be reduced if you can use the first character of the pattern (lets say "a"):

[^a]*(?:(?!pattern)a[^a]*)*

3. list all that is not the pattern.

using character classes

Lets say your pattern is /hello/:

([^h]|h(([^eh]|$)|e(([^lh]|$)|l(([^lh]|$)|l([^oh]|$))))*

This way becomes quickly fastidious when the number of characters is important, but it can be useful for regex flavors that haven't many features like POSIX regex.

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125