103

In my home directory I have a folder drupal-6.14 that contains the Drupal platform.

From this directory I use the following command:

find drupal-6.14 -type f -iname '*' | grep -P 'drupal-6.14/(?!sites(?!/all|/default)).*' | xargs tar -czf drupal-6.14.tar.gz

What this command does is gzips the folder drupal-6.14, excluding all subfolders of drupal-6.14/sites/ except sites/all and sites/default, which it includes.

My question is on the regular expression:

grep -P 'drupal-6.14/(?!sites(?!/all|/default)).*'

The expression works to exclude all the folders I want excluded, but I don't quite understand why.

It is a common task using regular expressions to

Match all strings, except those that don't contain subpattern x. Or in other words, negating a subpattern.

I (think) I understand that the general strategy to solve these problems is the use of negative lookaheads, but I've never understood to a satisfactory level how positive and negative look(ahead/behind)s work.

Over the years, I've read many websites on them. The PHP and Python regex manuals, other pages like http://www.regular-expressions.info/lookaround.html and so forth, but I've never really had a solid understanding of them.

Could someone explain, how this is working, and perhaps provide some similar examples that would do similar things?

-- Update One:

Regarding Andomar's response: can a double negative lookahead be more succinctly expressed as a single positive lookahead statement:

i.e Is:

'drupal-6.14/(?!sites(?!/all|/default)).*'

equivalent to:

'drupal-6.14/(?=sites(?:/all|/default)).*'

???

-- Update Two:

As per @andomar and @alan moore - you can't interchange double negative lookahead for positive lookahead.

themesandmodules
  • 1,228
  • 2
  • 8
  • 7

3 Answers3

181

A negative lookahead says, at this position, the following regex can not match.

Let's take a simplified example:

a(?!b(?!c))

a      Match: (?!b) succeeds
ac     Match: (?!b) succeeds
ab     No match: (?!b(?!c)) fails
abe    No match: (?!b(?!c)) fails
abc    Match: (?!b(?!c)) succeeds

The last example is a double negation: it allows b followed by c. The nested negative lookahead becomes a positive lookahead: the c should be present.

In each example, only the a is matched. The lookahead is only a condition, and does not add to the matched text.

Andomar
  • 232,371
  • 49
  • 380
  • 404
  • If a nested negative lookahead ("double negative lookahead") can become a positive lookahead, is it possible to state an equivalent in positive lookahead form? i.e: (a) What would be the positive lookahead form of my double negative lookahead drupal "'drupal-6.14/(?!sites(?!/all|/default)).*'" example? Would it be: 'drupal-6.14/(?=sites/all|default).* ??? (b) What would be the positive lookahead form of your double negative lookahead "(!?b(?!c))" example? – themesandmodules Nov 24 '09 at 00:47
  • @willieseabrook: Don't think so, only part of the lookahead is double negative, so you can't replace the whole with a positive one – Andomar Nov 24 '09 at 06:14
  • 1
    i'd been having an issue with negative lookahead and your statement "at this position" is what clarified what i was doing wrong. thanks. – just mike Feb 23 '11 at 14:24
  • 1
    Any idea why this does not work in R. I get Error in grep("a(?!b(?!c))", "a") Invalid regex – pssguy Mar 09 '12 at 17:41
15

Lookarounds can be nested.

So this regex matches "drupal-6.14/" that is not followed by "sites" that is not followed by "/all" or "/default".

Confusing? Using different words, we can say it matches "drupal-6.14/" that is not followed by "sites" unless that is further followed by "/all" or "/default"

ʞɔıu
  • 47,148
  • 35
  • 106
  • 149
  • 1
    Thanks for this. And *yes* I do still find it confusing LOL. I think you're quote of "not followed by sites *unless* followed by all|default" is quite helpful. – themesandmodules Nov 24 '09 at 00:52
7

If you revise your regular expression like this:

drupal-6.14/(?=sites(?!/all|/default)).*
             ^^

...then it will match all inputs that contain drupal-6.14/ followed by sites followed by anything other than /all or /default. For example:

drupal-6.14/sites/foo
drupal-6.14/sites/bar
drupal-6.14/sitesfoo42
drupal-6.14/sitesall

Changing ?= to ?! to match your original regex simply negates those matches:

drupal-6.14/(?!sites(?!/all|/default)).*
             ^^

So, this simply means that drupal-6.14/ now cannot be followed by sites followed by anything other than /all or /default. So now, these inputs will satisfy the regex:

drupal-6.14/sites/all
drupal-6.14/sites/default
drupal-6.14/sites/all42

But, what may not be obvious from some of the other answers (and possibly your question) is that your regex will also permit other inputs where drupal-6.14/ is followed by anything other than sites as well. For example:

drupal-6.14/foo
drupal-6.14/xsites

Conclusion: So, your regex basically says to include all subdirectories of drupal-6.14 except those subdirectories of sites whose name begins with anything other than all or default.

DavidRR
  • 18,291
  • 25
  • 109
  • 191