How to grep/perl/awk overlapping regex

Question

Trying to pipe a string into a grep/perl regex to pull out overlapping matches. Currently, the results only appear to pull out sequential matches without any "lookback":

Attempt using egrep (both on GNU and BSD):

$ echo "bob mary mike bill kim jim john" | egrep -io "[a-z]+ [a-z]+"
bob mary
mike bill
kim jim

Attempt using perl style grep (-P):

$ echo "bob mary mike bill kim jim john" | grep -oP "()[a-z]+ [a-z]+"
bob mary
mike bill
kim jim

Attempt using awk showing only the first match:

$ echo "bob mary mike bill kim jim john" | awk 'match($0, /[a-z]+ [a-z]+/) {print substr($0, RSTART, RLENGTH)}'
bob mary

The overlapping results I'd like to see from a simple working bash pipe command are:

bob mary
mary mike
mike bill
bill kim
kim jim
jim john

Any ideas?

Check out "lookahead assertions." You can combine these with global (/g) option and a capture group inside the assertion to retrieve the matches. — Gene, Oct 06 '21 at 03:28

zdim · Accepted Answer · 2021-10-07T17:22:16.930

Lookahead is your friend here

echo "bob mary mike bill kim jim john" | 
    perl -wnE'say "$1 $2" while /(\w+)\s+(?=(\w+))/g'

The point is that lookahead, as a "zero-width assertion," doesn't consume anything -- while it still allows us to capture a pattern in it.

So as the regex engine matches a word and spaces ((\w+)\s+), gobbling them up, it then stops there and "looks ahead," merely to "assert" that the sought pattern is there; it doesn't move from its spot between the last space and the next \w, doesn't "consume" that next word, as they say.

It is nice though that we can also capture that pattern that is "seen," even tough it's not consumed! So we get our $1 and $2, two words.

Then, because of /g modifier, the engine moves on, to find another word+spaces, with yet another word following. That next word is the one our lookahead spotted -- so now that one is consumed, and yet next one "looked" for (and captured). Etc.

See Lookahead and lookbehind assertions in perlretut

score 2 · Answer 2 · answered Oct 06 '21 at 17:32

Use the Perl one-liners below, which avoid the lookahead (which can still be your friend):
For whitespace-delimited words:

echo "bob mary mike bill kim jim john" | perl -lane 'print "$F[$_] $F[$_+1]" for 0..($#F-1);'

For words defined as \w+ in Perl, delimited by the non-word characters \W+:

echo "bob.mary,mike'bill kim jim john" | perl -F'/\W+/' -lane 'print "$F[$_] $F[$_+1]" for 0..($#F-1);'

The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array @F on whitespace or on the regex specified in -F option.
-F'/\W+/' : Split into @F on \W+ (one or more non-word characters), rather than on whitespace.

$#F : the last index of the array @F, into which the input line is split.
0..($#F-1) : the range of indexes (numbers), from the first (0) to the penultimate ($#F-1) index of the array @F.
$F[$_] and $F[$_+1]: two consecutive elements of the array @F, with indexes $_ and $_+1, respectively.

SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
perldoc perlre: Perl regular expressions (regexes)
perldoc perlre: Perl regular expressions (regexes): Quantifiers; Character Classes and other Special Escapes; Assertions; Capture groups
perldoc perlrequick: Perl regular expressions quick start

Wiktor Stribiżew · Answer 3 · 2021-10-07T10:45:55.503

You can also use awk

awk '{for(i=1;i<NF;i++) print $i,$(i+1)}' <<< 'bob mary mike bill kim jim john'

See the online demo. This solution iterates over all whitespace-separated fields and prints current field ($i) + field separator (a space here) + the subsequent field value ($(i+1)).

Or, another perl solution that uses a very common technique to capture the overlapping pattern inside a positive lookahead:

perl -lane 'while (/(?=\b(\p{L}+\s+\p{L}+))/g) {print $1}' <<< 'bob mary mike bill kim jim john'

See the online demo. Details:

(?= - start of a positive lookahead
- \b - a word boundary
- (\p{L}+\s+\p{L}+) - capturing group 1: one or more letters, one or more whitespaces, one or more letters
) - end of the lookahead.

Here, only Group 1 values are printed ({print $1}).

Performance consideration

As for Perl solutions here, mine turns out the slowest, and Timur's the fastest, however, awk solution turns out to be faster than any Perl solutions. Results:

# ./wiktor_awk.sh

real    0m17.069s
user    0m12.264s
sys     0m5.314s

# ./timur_perl.sh

real    0m18.201s
user    0m15.612s
sys     0m6.139s

# ./zdim.sh

real    0m23.559s
user    0m19.883s
sys     0m7.359s

# ./wiktor_perl.sh

real    2m12.528s
user    1m52.857s
sys     0m20.201s

Note I created *.sh files for each solution like

#!/bin/bash
N=10000
time(
 for i in $(seq 1 $N); do
   <SOLUTION_HERE> &>/dev/null;
done)

and ran for f in *.sh; do chmod +x "$f"; done (borrowed from here).

How to grep/perl/awk overlapping regex

3 Answers3