Using positive-lookahead (?=regex) with re2

Question

Since I'm a bit new with re2, I'm trying to figure out how to use positive-lookahead (?=regex) like JS, C++ or any PCRE style in Go.

Here's some examples of what I'm looking for.

JS:

'foo bar baz'.match(/^[\s\S]+?(?=baz|$)/);

Python:

re.match('^[\s\S]+?(?=baz|$)', 'foo bar baz')

Note: both examples match 'foo bar '

Thanks a lot.

Looking at https://github.com/google/re2/wiki/Syntax - there is a line saying "`(?=re)` before text matching `re` (NOT SUPPORTED)". This doesn't look good. Also, it says "alternative to backtracking regular expression engines" - suggesting they'd drop some features. — Kobi, May 18 '15 at 14:19
@Kobi there is now [dlclark/regexp2](https://github.com/dlclark/regexp2) available — Andy, Jul 24 '17 at 23:21
@Andy - Thanks! So Go has `regexp` (which is re2), and `regexp2` (which isn't re2). That is a poor choice of library names - I think this is even more confusing than Python's `re` and `regex` libraries `:P`. Looks like it was ported from .Net with [balancing groups](https://github.com/dlclark/regexp2/blob/487489b64fb796de2e55f4e8a4ad1e145f80e957/regexp_mono_test.go#L998,L1002), which are [my favorite regex feature](https://kobikobi.wordpress.com/tag/regex/) - I'll have a look. — Kobi, Jul 25 '17 at 02:16

Kobi · Accepted Answer · 2017-01-05T14:18:06.883

19

According to the Syntax Documentation, this feature isn't supported:

(?=re) before text matching re (NOT SUPPORTED)

Also, from WhyRE2:

As a matter of principle, RE2 does not support constructs for which only backtracking solutions are known to exist. Thus, backreferences and look-around assertions are not supported.

edited Jan 05 '17 at 14:18

answered May 18 '15 at 14:30

Kobi

135,331
41
252
292

score 12 · Answer 2 · answered May 18 '15 at 14:28

12

You can achieve this with a simpler regexp:

re := regexp.MustCompile(`^(.+?)(?:baz)?$`)
sm := re.FindStringSubmatch("foo bar baz")
fmt.Printf("%q\n", sm)

sm[1] will be your match. Playground: http://play.golang.org/p/Vyah7cfBlH

answered May 18 '15 at 14:28

Ainar-G

34,563
13
93
119

2

Yes, capturing group is the only means to achieve that... at least, until look-aheads are implemented in Go, – Wiktor Stribiżew May 18 '15 at 14:32
5

@stribizhev (wrt your "until look-aheads are implemented in Go" comment), I doubt such features will ever be added to Go or that Go will switch from using RE2. (Although you could probably use a third party PCRE package, I wouldn't recommend that). Most/all of these "features" are not supported due to the basic design which is a deliberate choice made between "advanced" (but slow/dangerous) features and speed and safety (in terms of run-time/memory). See https://swtch.com/~rsc/regexp/regexp1.html for details (or just look at the graphs). – Dave C May 18 '15 at 17:04
5

FWIW, recent research on handling lookaheads in linear time for PCRE-like engines: https://medium.com/@davisjam/using-selective-memoization-to-defeat-regular-expression-denial-of-service-f7acbbd34792 – James Davis Sep 21 '20 at 14:59

Markus Frömmel · Answer 3 · 2023-07-13T07:56:48.893

In cases where you want to match a broad pattern, but exclude specific substrings purely in Regex you can use a technique called "Stepwise Exclusion"

This technique involves iteratively refining the regex to exclude specific sequences character by character.

Let's consider an example. Suppose you want to match all email addresses ending with "@google.com", but exclude the specific address "noreply@google.com". Here's how you would construct such a regex using the stepwise exclusion technique:

^(?i)([\w]{1,6}|[a-mo-z0-9_][\w]*|n[a-np-z0-9_][\w]*|no[a-qs-z0-9_][\w]*|nor[a-df-z0-9_][\w]*|nore[a-oq-z0-9_][\w]*|norep[a-km-z0-9_][\w]*|norepl[a-xz0-9_][\w]*)@google\.com

Breakdown of the Pattern

(?i): This flag makes the regex case insensitive.
[\w]{1,6}: This part matches any email address containing shorter but not complete parts of noreply such as no@google.com
[a-mo-z0-9_][\w]*: This part matches any email that starts with any alphanumeric character or underscore (except for n) and ends with @google.com.
Each subsequent part of the pattern (e.g., n[a-np-z0-9_][\w]*, no[a-qs-z0-9_][\w]*, etc.) is designed to progressively exclude the characters in "noreply" when they appear in the same sequence.
The last part, noreply[\w]*, matches addresses that start with 'noreply' and have additional characters before @google.com.

Using positive-lookahead (?=regex) with re2

3 Answers3

Linked

Related