18

I am trying to write a replacement regular expression to surround all words in quotes except the words AND, OR and NOT.

I have tried the following for the match part of the expression:

(?i)(?<word>[a-z0-9]+)(?<!and|not|or)

and

(?i)(?<word>[a-z0-9]+)(?!and|not|or)

but neither work. The replacement expression is simple and currently surrounds all words.

"${word}"

So

This and This not That

becomes

"This" and "This" not "That"

John
  • 29,788
  • 18
  • 89
  • 130

6 Answers6

14

This is a little dirty, but it works:

(?<!\b(?:and| or|not))\b(?!(?:and|or|not)\b)

In plain English, this matches any word boundary not preceded by and not followed by "and", "or", or "not". It matches whole words only, e.g. the position after the word "sand" would not be a match just because it is preceded by "and".

The space in front of the "or" in the zero-width look-behind assertion is necessary to make it a fixed length look-behind. Try if that already solves your problem.

EDIT: Applied to the string "except the words AND, OR and NOT." as a global replace with single quotes, this returns:

'except' 'the' 'words' AND, OR and NOT.
Tomalak
  • 332,285
  • 67
  • 532
  • 628
  • The only situation where this can fail is when the string starts with the word "or". Okay, and it contains the hidden assumption that spaces separate your words. Both situations can be migrated if you know your data. – Tomalak Oct 28 '08 at 10:12
  • As with all regex, it is crazy but it works. (?[a-z0-9]+)(?<!and| or|not)\b(?!and|or|not) Thanks – John Oct 28 '08 at 10:12
  • What do you need "(?[a-z0-9]+)" for? Are you trying to surround your words with quotes or are you trying to pluck them out of the string? – Tomalak Oct 28 '08 at 10:18
  • 1
    that fails for words ending or beginning with any of the given words. "helloand not goodbye" -> "'helloand not 'goodbye'" – Markus Jarderot Oct 28 '08 at 11:58
  • True. Thanks for the tip, I expanded the regex to account for that. – Tomalak Oct 28 '08 at 12:14
5

John,

The regex in your question is almost correct. The only problem is that you put the lookahead at the end of the regex instead of at the start. Also, you need to add word boundaries to force the regex to match whole words. Otherwise, it will match "nd" in "and", "r" in "or", etc, because "nd" and "r" are not in your negative lookahead.

(?i)\b(?!and|not|or)(?[a-z0-9]+)\b

Jan Goyvaerts
  • 21,379
  • 7
  • 60
  • 72
  • Yes, everyone else is making this a lot more complicated than it needs to be. In particular, there's no need for negative (or positive, for that matter) lookbehinds or named captures. – Alan Moore Nov 02 '08 at 12:15
  • Two things: first, I’ve come to the conclusion that specifiying a literal `[a-z]` in a regex instead of `\pL` or `\p{Alphabetic}` or sometimes `[[:alpha:]]` is almost always too “1960s” in our post–7‐bit age. Second, I find people [often misunderstand what \b really does](http://stackoverflow.com/questions/4213800/is-there-something-like-a-counter-variable-in-regular-expression-replace/4214173#4214173), so lately I’ve been adding provisos on its gotchas whenever I recommend it. (Yes, I know that *you* of course understand all this, Jan, but many readers probably do not.) – tchrist Nov 18 '10 at 16:27
4

Call me crazy, but I'm not a fan of fighting regex; I limit my patterns to simple things I can understand, and often cheat for the rest - for example via a MatchEvaluator:

    string[] whitelist = new string[] { "and", "not", "or" };
    string input = "foo and bar or blop";
    string result = Regex.Replace(input, @"([a-z0-9]+)",
        delegate(Match match) {
            string word = match.Groups[1].Value;
            return Array.IndexOf(whitelist, word) >= 0
                ? word : ("\"" + word + "\"");
        });

(edited for more terse layout)

Marc Gravell
  • 1,026,079
  • 266
  • 2,566
  • 2,900
2

Based on Tomalaks answer:

(?<!and|or|not)\b(?!and|or|not)

This regex has two problems:

  1. (?<! ) only works for fixed length look-behind

  2. The previous regex only looked at end ending/beginning of the surrounding words, not the whole word.

(?<!\band)(?<!\bor)(?<!\bnot)\b(?!(?:and|or|not)\b)

This regex fixes both the above problems. First by splitting the look-behind into three separate ones. Second by adding word-boundaries (\b) inside the look-arounds.

Markus Jarderot
  • 86,735
  • 21
  • 136
  • 138
2

To match any "word" that is a combination of letters, digits or underscores (including any other word chars defined in the \w shorthand character class), you may use word boundaries like in

\b(?!(?:word1|word2|word3)\b)\w+

If the "word" is a chunk of non-whitespace characters with start/end of string or whitespace on both ends use whitespace boundaries like in

(?<!\S)(?!(?:word1|word2|word3)(?!\S))\S+

Here, the two expressions will look like

\b(?!(?:and|not|or)\b)\w+
(?<!\S)(?!(?:and|not|or)(?!\S))\S+

See the regex demo (or, a popular regex101 demo, but please note that PCRE \w meaning is different from the .NET \w meaning.)

Pattern explanation

  • \b - word boundary
  • (?<!\S) - a negative lookbehind that matches a location that is not immediately preceded with a character other than whitespace, it requires a start of string position or a whitespace char to be right before the current location
  • (?!(?:word1|word2|word3)\b) - a negative lookahead that fails the match if, immediately to the right of the current location, there is word1, word2 or word3 char sequences followed with a word boundary (or, if (?!\S) whitespace right-hand boundary is used, there must be a whitespace or end of string immediately to the right of the current location)
  • \w+ - 1+ word chars
  • \S+ - 1+ chars other than whitespace

In C#, and any other programming language, you may build the pattern dynamically, by joining array/list items with a pipe character, see the demo below:

var exceptions = new[] { "and", "not", "or" };
var result = Regex.Replace("This and This not That", 
        $@"\b(?!(?:{string.Join("|", exceptions)})\b)\w+",
        "\"$&\"");
Console.WriteLine(result); // => "This" and "This" not "That"

If your "words" may contain special characters, the whitespace boundaries approach is more suitable, and make sure to escape the "words" with, say, exceptions.Select(Regex.Escape):

var pattern = $@"(?<!\S)(?!(?:{string.Join("|", exceptions.Select(Regex.Escape))})(?!\S))\S+";

NOTE: If there are too many words to search for, it might be a better idea to build a regex trie out of them.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
0
(?!\bnot\b|\band\b|\bor\b|\b\"[^"]+\"\b)((?<=\s|\-|\(|^)[^\"\s\()]+(?=\s|\*|\)|$))

I use this regex to find all words that are not within double quotes or are the words "not" "and" or "or."

bluish
  • 26,356
  • 27
  • 122
  • 180