15

I have been looking through SO and although this question has been answered in one scenario:

Regex to match all words except a given list

It's not quite what I'm looking for. I am trying to write a regular expression which matches any string of the form [\w]+[(], but which doesn't match the three strings "cat(", "dog(" and "sheep(" specifically.

I have been playing with lookahead and lookbehind, but I can't quite get there. I may be overcomplicating this, so any help would be greatly appreciated.

Community
  • 1
  • 1
Huguenot
  • 2,427
  • 2
  • 17
  • 14

2 Answers2

22

If the regular expression implementation supports look-ahead or look-behind assertions, you could use the following:

  • Using a negative look-ahead assertion:

     \b(?!(?:cat|dog|sheep)\()\w+\(
    
  • Using a negative look-behind assertion:

     \b\w+\((?<!\b(?:cat|dog|sheep)\()
    

I added the \b anchor that marks a word boundary. So catdog( would be matched although it contains dog(.

But while look-ahead assertions are more widely supported by regex implementations, the regex with the look-behind assertion is more efficient since it’s only tested if the preceding regex (in our case \b\w+\() already did match. However the look-ahead assertion would be tested before the actual regex would match. So in our case the look-ahead assertion is tested whenever \b is matched.

Gumbo
  • 643,351
  • 109
  • 780
  • 844
  • 1
    The second one is most likely efficient since it doesn't check every single position with a negative look-ahead (it's worth noting that they're negative.) Also, I'm thinking it might be better to put the negative look-behind after the parenthesis and include a parenthesis in the look-behind. This way, it will only perform an extra look-behind once it finds a possible match, rather than for every word in the string. – Blixt Jul 23 '09 at 16:25
  • Also, your first regex will reject `catastrophe(`, `dogmatic(` and `sheepily(`. Your second one is saved from a similar error by the `\b` in the look-behind. – rampion Jul 23 '09 at 16:33
  • Right, I had a go with > grep '\b(?!(?:cat|dog|sheep))\w+[(]' text.txt text.txt cat() dog() catdog() something() And it's not returning anything. I also had a go in textmate with the regular expression search but nada. I can see the logic behind the first statement though, perhaps this is a compatibility issue? I thought look-ahead was pretty standard. I've certainly been using it today in some form or another. – Huguenot Jul 23 '09 at 16:37
  • ah, no carriage returns, the second text.txt indicates the file's contents. – Huguenot Jul 23 '09 at 16:40
  • grep uses POSIX regular expressions, not PCRE, which are slightly different. I don't think the POSIX standard includes lookbehinds or lookaheads. – rampion Jul 23 '09 at 16:40
  • Thank you, it was an unfortunate coincidence that both methods weren't working properly. Textmate had copied over a carriage return into the regular expression search box. Thanks very much for the help everyone – Huguenot Jul 23 '09 at 16:45
  • I have modifed the first one to "\b(?!cat\\(|dog\\(|sheep\\()\w+\\(" to prevent the problem mentioned by Rampion and that seems to work. For some reason, it didn't like the word boundary in the second expression, so I changed it to a \W à la \b[A-Za-z]+\\((?<!\W(?:cat\\(|dog\\(|rat\\()) and that seems to have done the trick. Note that I had to change the third term to the same size as the others, fortunately this is not a problem. – Huguenot Jul 23 '09 at 17:32
  • The first regex just needed grouping parens around the alternation: `\b(?!(?:cat|dog|sheep)\()\w+\(` – Alan Moore Jul 23 '09 at 17:45
4

Do you really require this in a single regex? If not, then the simplest implementation is just two regexes - one to check you don't match one of your forbidden words, and one to match your \w+, chained with a logical AND.

ire_and_curses
  • 68,372
  • 23
  • 116
  • 141