5

(Note: not a duplicate of Why can't you use repetition quantifiers in zero-width look behind assertions; see end of post.)

I'm trying to write a grep -P (Perl) regex that matches B, when it is not preceded by A -- regardless of whether there is intervening whitespace.

So, I tried this negative lookbehind, and tested it in regex101.com:

(?<!A)\s*B

This causes "AB" not to be matched, which is good, but "A B" does result in a match, which is not what I want.

I am not exactly sure why this is. It has something to do with the fact that \s* matches the empty string "", and you can say that there are, as such, infinity matches of \s* between A and B. But why does this affect "A B" but not "AB"?

Is the following regex a proper solution, and if so, why exactly does it fix the problem?

(?<![A\s])\s*B

I posted this before and it was incorrectly marked as a duplicate question. The variable-length thing I'm looking for is part of the match, not part of the negative lookbehind itself -- so this quite different from the other question. Yes, I could put the \s* inside the negative lookbehind, but I haven't done so (and doing so is not supported, as the other question explains). Also, I am particularly interested in why the alternate regex I post above works, since I know it works but I'm not exactly sure why. The other question did not help answer that.

brian d foy
  • 129,424
  • 31
  • 207
  • 592
std_answ
  • 1,039
  • 1
  • 11
  • 17
  • You could also use `/[^A\s]\s*B/` – Håkon Hægland Mar 29 '17 at 22:19
  • Good point. In my real use case though, A and B are both words rather than just characters. – std_answ Mar 29 '17 at 22:21
  • 1
    `(?<![A\s])\s*B` Not really a good way to do this. One reason is the humungous backtracking going on. Maybe there will come a day when you care more about performance over substance. Since you're using Perl, leverage it's verbs. `(?:A\s*B(*SKIP)(*FAIL)|B)` –  Mar 30 '17 at 01:10
  • 1
    Comparison `Regex1: (?<![A\s])\s*B Completed iterations: 50 / 50 ( x 1000 ) Matches found per iteration: 1 Elapsed Time: 0.53 s, 530.18 ms, 530185 µs Regex2: (?:A\s*B(*SKIP)(*FAIL)|B) Completed iterations: 50 / 50 ( x 1000 ) Matches found per iteration: 1 Elapsed Time: 0.18 s, 180.07 ms, 180073 µs` –  Mar 30 '17 at 01:16
  • @sln: the special verbs are particularly useful since they can be used when A and B are whole words rather than just characters. – std_answ Mar 30 '17 at 14:21
  • @ikegami: Good catch, since without the `^|` option, `[^A\s]\s*B` fails to match either "B" or " B". – std_answ Mar 30 '17 at 14:24
  • @wdep1, I had failed to noticed the first two comments. Deleteing. – ikegami Mar 30 '17 at 14:34
  • My bad for not specifying that they were whole words to begin with. Still useful info to have. For reference, ikegami noted that if A and B *are* just characters and not words, `[^A\s]\s*B` has problems as described, which `(?:^|[^A\s])\s*B` fixes. – std_answ Mar 30 '17 at 14:45

1 Answers1

6

But why does this affect "A B" but not "AB"?

Regexes match at a position, which it is helpful to think of as being between characters. In "A B" there is a position (after the space and before the B) where (?<!A) succeeds (because there isn't an A immediately preceding; there's a space instead), and \s*B succeeds (\s* matches the empty string, and B matches B), so the entire pattern succeeds.

In "AB" there is no such position. The only place where \s*B can match (immediately before the B), is also immediately after the A, so (?<!A) cannot succeed. There are no positions that satisfy both, so the pattern as a whole can't succeed.

Is the following regex a proper solution, and if so, why exactly does it fix the problem?

(?<![A\s])\s*B

This works because (?<![A\s]) will not succeed immediately after an A or after a space. So now the lookbehind forbids any match position that has spaces before it. If there are any spaces before the B, they have to be consumed by the \s* portion of the pattern, and the match position must be before them. If that position also doesn't have an A before it, the lookbehind can succeed and the pattern as a whole can match.

This is a trick that's made possible by the fact that \s is a fixed-width pattern that matches at every position inside of a non-empty \s* match. It can't be extended to the general case of any pattern between the (non-)A and the B.

Community
  • 1
  • 1
hobbs
  • 223,387
  • 19
  • 210
  • 288
  • Makes sense, thanks! Re: your first point: It took me a minute to realize that "so (?<!A) cannot match" means that the negative lookbehind recognizes A successfully, and says that the pattern match against the whole string fails as a result of the negative lookbehind taking effect. – std_answ Mar 29 '17 at 23:05
  • To summarize, for anyone who's reading this and confused: For the original regex, "A B" is a tricky case because there's a potential match position before B where the \s* acts like an empty string and there's a preceding space, rather than a preceding A, so the negative lookbehind doesn't forbid a match. To fix this, the altered regex makes sure that only match positions not directly after spaces can be considered. – std_answ Mar 29 '17 at 23:05
  • @wdep1 valid point! I changed "match" to "succeed" which is hopefully clearer (a negative lookaround succeeds by not matching anything). – hobbs Mar 30 '17 at 03:38