4

I have the following example

"Foo tells bar that bar likes potato. Bar tells foo that bar does not like potato."

I want the substring between potato and preceding occurrence of bar. So In this example, I want "bar likes potato" and also want "bar does not like potato" as the result. How can I achieve this through one regex? I know if I apply two separate regex I could get the results but I want to know if this is possible with only one regex.

Thanks, RG

  • see ridgerunner answer here: http://stackoverflow.com/questions/406230/regular-expression-to-match-text-that-doesnt-contain-a-word/15995332#15995332 – Casimir et Hippolyte May 05 '15 at 18:19

3 Answers3

3

Nice riddle. It can be solved, just not in a very nice way:

echo "Foo tells bar that bar likes potato. Bar tells foo that bar does not like potato." | \
    pcregrep  -o '\bbar\s+(?:(?:(?!bar\b)\w+)\s+)*?potato\b'

The outer (?:...) matches a word followed by a space. The inner one makes sure said word is not bar.

lcd047
  • 5,731
  • 2
  • 28
  • 38
1

Try this in Python:

>>> import re
>>> s = "Foo tells bar that bar likes potato. Bar tells foo that bar does not like potato."
>>> re.findall('bar (?:(?! bar ).)+? potato', s)
['bar likes potato', 'bar does not like potato']
suzanshakya
  • 3,524
  • 1
  • 26
  • 22
  • Worked like a charm! Could you please explain how the inner loop "(?:(?! bar ).)" works in this case? – user3669040 May 05 '15 at 18:37
  • @user3669040 `(?:...)` is a non-capturing version of regular parentheses (parentheses is used for grouping). `(?!...)` is a negative lookahead assertion, i.e., `Isaac (?!Asimov)` will match `Isaac ` only if it’s not followed by `Asimov`. – suzanshakya May 17 '15 at 09:47
0

It is possible, as the following perl snippet demonstrates:

use strict;
use warnings;

my $str
  = "Foo tells bar that bar likes potato. "
  . "Bar tells foo that bar does not like potato."
;

while ($str =~ m/( bar (?: [^b] | b[^a] | ba[^r] )*?  potato )/xmsg) {
    print STDOUT "$1\n";
}

*? is a non-greedy quantifier (Match 0 or more times, not greedily; see Quantifiers at http://perldoc.perl.org/perlre.html)

Note that the alternatives [^b] | b[^a] | ba[^r] are mutually exclusive. The book "Mastering regular expressions" (http://regex.info/) is very instructive if you want to learn more about such constructions.

Loic
  • 640
  • 4
  • 13
  • This works, but extending it to longer words gets increasingly complex. You can avoid that with a negative lookahead, as I did in my answer. The trick is to anchor the lookahead at both ends to make it effective. – lcd047 May 05 '15 at 18:18
  • It is true, your solution performs significantly better (about twice as fast). However, it does assume space-separated words (mine does not); since it is a likely assumption, I would prefer yours. It can be simplified a bit: `bar\s+(?>(?!bar\b)\w+\s+)*?potato` – Loic May 05 '15 at 18:43
  • @Loic: I will write it like this: `\bbar\W+(?:(?!bar\b)\w+\W+)*?potato\b` *(atomic group is not needed for backtracking considerations since you advance with a lazy quantifier)* – Casimir et Hippolyte May 05 '15 at 18:53
  • @Loic: Yes, for my solution it's essential to have something to use as anchors at both ends of the negative lookahead, that's the real crux of the matter. – lcd047 May 05 '15 at 19:04
  • @Casimir et Hippolyte: It is not true, non-greedy quantification does not ensure that the quantified group is handled atomically. For instance, the string `"aa"` is accepted by `(?:a*)*?a` but not by `(?>a*)*?a`; thus backtracking happens with the first regexp. In general, I prefer to avoid nested quantifications wherever possible, because they may have poor performance on pathological inputs (not always, but there is still a risk). – Loic May 05 '15 at 19:44
  • Indeed It can backtrack on spaces or non word characters when "patato" is not in the string , so writing: `\bbar\W+(?:(?!bar\b)\w+\W++)*?potato\b` solves the problem. About "nested quantifications", you indeed need to be carefull but using systematically atomic groups or possessive quantifiers, or writing patterns designed for grep: `(a|b|c)*` is a bad idea. (with a backtracking engine the last one can be terrible) You need to take all in account: how looks the string, what contains the repeated group, if it is an always true subpattern, the character classes, etc. – Casimir et Hippolyte May 05 '15 at 21:21