37

I'm looking for a general regex construct to match everything in pattern x EXCEPT matches to pattern y. This is hard to explain both completely and concisely...see Material Nonimplication for a formal definition.

For example, match any word character (\w) EXCEPT 'p'. Note I'm subtracting a small set (the letter 'p') from a larger set (all word characters). I can't just say [^p] because that doesn't take into account the larger limiting set of only word characters. For this little example, sure, I could manually reconstruct something like [a-oq-zA-OQ-Z0-9_], which is a pain but doable. But i'm looking for a more general construct so that at least the large positive set can be a more complex expression. Like match ((?<=(so|me|^))big(com?pl{1,3}ex([pA]t{2}ern) except when it starts with "My".

Edit: I realize that was a bad example, since excluding stuff at the begginning or end is a situation where negative look-ahead and look-behind expressions work. (Bohemian I still gave you an upvote for illustrating this). So...what about excluding matches that contain "My" somewhere in the middle?...I'm still really looking for a general construct, like a regex equivalent of the following pseudo-sql

select [captures] from [input]
where (
    input MATCHES [pattern1]
    AND NOT capture MATCHES [pattern2]
)

If there answer is "it does not exist and here is why..." I'd like to know that too.

Edit 2: If I wanted to define my own function to do this it would be something like (here's a C# LINQ version):

public static Match[] RegexMNI(string input, 
                               string positivePattern, 
                               string negativePattern) {
    return (from Match m in Regex.Matches(input, positivePattern)
            where !Regex.IsMatch(m.Value, negativePattern)
            select m).ToArray();
}

I'm STILL just wondering if there is a native regex construct that could do this.

Joshua Honig
  • 12,925
  • 8
  • 53
  • 75
  • Perhaps you could accept an answer... – Bohemian Apr 09 '13 at 22:14
  • 2
    @Bohemian No one actually answered the question. They all got stuck on the specifics of my example, rather than answer the question in the abstract but complete. Both edits provide the set-logic concept clearly. – Joshua Honig Apr 10 '13 at 00:14
  • 4
    To answer your edited question, the general solution to "contains A and not contains B" is `^(?!.*B).*A` – Bohemian Apr 10 '13 at 01:57
  • What is `?<=`? I've never seen that expression before, though I've mainly done JavaScript, whose regular expression language is not very expressive. – trysis Mar 16 '17 at 19:40
  • @Bohemian I think is ```^(?!.*B).*A.*``` to select the hole line – Yuri Aps Jun 16 '21 at 12:32

4 Answers4

28

This will match any character that is a word and is not a p:

((?=[^p])\w)

To solve your example, use a negative look-ahead for "My" anywhere in the input, ie (?!.*My):

^(?!.*My)((?<=(so|me|^))big(com?pl{1,3}ex([pA]t{2}ern)

Note the anchor to start of input ^ which is required to make it work.

Bohemian
  • 412,405
  • 93
  • 575
  • 722
  • Edited to change the negative look ahead to assert "My" doesn't appear *anywhere* in the input (previously it only checked for My at the start. – Bohemian Apr 10 '13 at 01:52
  • The OP makes it clear that he's after "My" not being in the *matching* expression he found. Your negative lookahead searches the *entire* string input rather than the subset. Really he's wanting to pipe one regex through another regex which doesn't seem possible without a script as far as I know. Any thoughts on how to solve this without a script or without making the lookahead as complex as the main regex pattern? – horta May 13 '15 at 21:33
16

I wonder why people try to do complicated things in big monolithic regular expressions?

Why can't you just break down the problem into sub-parts and then make really easy regular expressions to match those individually? In this case, first match \w, then match [^p] if that first match succeeds. Perl (and other languages) allows for constructing really complicated-looking regular expressions that allows you to do exactly what you need to do in one big blobby-regex (or, as it may well be, with a short and snappy crypto-regex), but for the sake of whoever it is that needs to read (and maintain!) the code once you've gone you need to document it fully. Better then to make it easy to understand from the start.

Sorry, rant over.

Kusalananda
  • 14,885
  • 3
  • 41
  • 52
  • 3
    Ah yes, the Zawinski effect, whereby using REs expands the number of problems. (My favorite was when someone asked for an RE to accept valid IEEE doubles that had been written into an XML document…) – Donal Fellows Sep 25 '11 at 21:55
  • The reason I want to match things in one go is that I want to capture and operate on numerous captures in an input string, such as finding and reformatting declarations that match a certain pattern in a few hundred lines of code. I could toss out regex altogether and go parsing character-by-character...but if there's good power tool might as well use it! – Joshua Honig Sep 25 '11 at 22:18
  • Programming languages are a different class of grammars (context free) than what regular expressions recognize (recursively enumerable), so be careful... – escape-llc Jan 05 '18 at 16:02
7

After your edits, its still the negative lookahead, but with an additional quantifier.

If you want to ensure that the whole string does not contain "My", then you can do this

(?!.*My)^.*$

See it here on Regexr

This will match any sequence of characters (with the .* at the end) and the (?!.*My).* at the beginning will fail when there is a "My" anywhere in the string.

If you want to match anything that si not exactly "My" then use anchors

(?!^My$).*
stema
  • 90,351
  • 20
  • 107
  • 135
1

So after looking through these topics on RegEx's: lookahead, lookbehind, nesting, AND operator, recursion, subroutines, conditionals, anchors, and groups, I've come to the conclusion that there is no solution that satisfies what you're asking for.

The reason why lookahead doesn't work is because it fails in this relatively simple case:

Three words without My included as one.

Regex:

^(?!.*My.*)(\b\w+\b\s\b\w+\b\s\b\w+\b)

Matches:

included as one

The first three words fail to match because My happens after them. If "My" is at the end of the entire string, you'll never match anything because every lookahead will fail because they will all see that.

The problem appears to be that while lookahead has an implicit anchor as to where it begins its match, there's no way of terminating where lookahead ends its search with an anchor based upon the result of another part of the RegEx. That means you really have to duplicate all of the RegEx into the negative lookahead to manually create the anchor you're after.

This is frustrating and a pain. The "solution" appears to be use a scripting language to perform two regex's. One on top of the other. I'm surprised this kind of functionality isn't better built into regular expression engines.

horta
  • 1,110
  • 8
  • 17