Keeping the lookahead value in regular expression

Question

Imagine I have the string abcdefghi If I apply the regular expression

m/([a-z])([a-z])/g

to it, I get disjoint pairs ab, cd, ef, gh.

What I want is all overlapping pairs ab, bc, cd, de, ef, fg, gh, hi.

When I use a lookahead, like

m/([a-z])(?=[a-z])/g

I get the first letter of each pair a, b, c, d, e, f, g, h, but the lookahead per se is not kept.

How can I tell the regex engine that I want the first letter but also the lookahead, in order to obtain pairs of letters ab, bc, cd, de, ef, fg, gh, hi?

You also capture what's inside the lookahead, like this `([a-z])(?=([a-z]))` — Sweeper, Sep 14 '19 at 15:16
See for example https://stackoverflow.com/questions/20833295/how-can-i-match-overlapping-strings-with-regex or https://stackoverflow.com/questions/11430863/how-to-find-overlapping-matches-with-a-regexp — The fourth bird, Sep 14 '19 at 15:24

score 1 · Accepted Answer · answered Sep 14 '19 at 15:24

The () around lookaheads are non-capturing, and because lookaheads are 0-width matches, you don't get the characters that are "looked at" in the result.

You just need to make the contents of the lookahead capturing by surround it with a capturing group:

([a-z])(?=([a-z]))

On a side note, there are other ways to get overlapping pairs, such as with a for loop that loops to (the string's length - 2). You might want to consider these options as well.

score 0 · Answer 2 · answered Sep 14 '19 at 18:22

You can do it by relying on the engines BUMP ALONG feature.
By using a zero width assertion containing a single capture group to contain
each pair.

Since the engine did not CONSUME any characters it has a built-in
mechanism to avoid an endless loop, which is to increment the current position
by 1.

(?=([a-z]{2}))

https://regex101.com/r/GYcgiZ/1

Or,

You can do it yourself by matching 2 and consuming 1.

(?=([a-z]{2})).

https://regex101.com/r/re917b/1

Keeping the lookahead value in regular expression

2 Answers2