20

Since I'm a bit new with re2, I'm trying to figure out how to use positive-lookahead (?=regex) like JS, C++ or any PCRE style in Go.

Here's some examples of what I'm looking for.

JS:

'foo bar baz'.match(/^[\s\S]+?(?=baz|$)/);

Python:

re.match('^[\s\S]+?(?=baz|$)', 'foo bar baz')
  • Note: both examples match 'foo bar '

Thanks a lot.

a8m
  • 9,334
  • 4
  • 37
  • 40
  • 6
    Looking at https://github.com/google/re2/wiki/Syntax - there is a line saying "`(?=re)` before text matching `re` (NOT SUPPORTED)". This doesn't look good. Also, it says "alternative to backtracking regular expression engines" - suggesting they'd drop some features. – Kobi May 18 '15 at 14:19
  • I guess that's a sort of an answer, so I've added one. – Kobi May 18 '15 at 14:32
  • 1
    @Kobi there is now [dlclark/regexp2](https://github.com/dlclark/regexp2) available – Andy Jul 24 '17 at 23:21
  • 3
    @Andy - Thanks! So Go has `regexp` (which is re2), and `regexp2` (which isn't re2). That is a poor choice of library names - I think this is even more confusing than Python's `re` and `regex` libraries `:P`. Looks like it was ported from .Net with [balancing groups](https://github.com/dlclark/regexp2/blob/487489b64fb796de2e55f4e8a4ad1e145f80e957/regexp_mono_test.go#L998,L1002), which are [my favorite regex feature](https://kobikobi.wordpress.com/tag/regex/) - I'll have a look. – Kobi Jul 25 '17 at 02:16

3 Answers3

19

According to the Syntax Documentation, this feature isn't supported:

(?=re) before text matching re (NOT SUPPORTED)

Also, from WhyRE2:

As a matter of principle, RE2 does not support constructs for which only backtracking solutions are known to exist. Thus, backreferences and look-around assertions are not supported.

Kobi
  • 135,331
  • 41
  • 252
  • 292
12

You can achieve this with a simpler regexp:

re := regexp.MustCompile(`^(.+?)(?:baz)?$`)
sm := re.FindStringSubmatch("foo bar baz")
fmt.Printf("%q\n", sm)

sm[1] will be your match. Playground: http://play.golang.org/p/Vyah7cfBlH

Ainar-G
  • 34,563
  • 13
  • 93
  • 119
  • 2
    Yes, capturing group is the only means to achieve that... at least, until look-aheads are implemented in Go, – Wiktor Stribiżew May 18 '15 at 14:32
  • 5
    @stribizhev (wrt your "until look-aheads are implemented in Go" comment), I doubt such features will ever be added to Go or that Go will switch from using RE2. (Although you could probably use a third party PCRE package, I wouldn't recommend that). Most/all of these "features" are not supported due to the basic design which is a deliberate choice made between "advanced" (but slow/dangerous) features and speed and safety (in terms of run-time/memory). See https://swtch.com/~rsc/regexp/regexp1.html for details (or just look at the graphs). – Dave C May 18 '15 at 17:04
  • 5
    FWIW, recent research on handling lookaheads in linear time for PCRE-like engines: https://medium.com/@davisjam/using-selective-memoization-to-defeat-regular-expression-denial-of-service-f7acbbd34792 – James Davis Sep 21 '20 at 14:59
0

In cases where you want to match a broad pattern, but exclude specific substrings purely in Regex you can use a technique called "Stepwise Exclusion"

This technique involves iteratively refining the regex to exclude specific sequences character by character.

Let's consider an example. Suppose you want to match all email addresses ending with "@google.com", but exclude the specific address "noreply@google.com". Here's how you would construct such a regex using the stepwise exclusion technique:

^(?i)([\w]{1,6}|[a-mo-z0-9_][\w]*|n[a-np-z0-9_][\w]*|no[a-qs-z0-9_][\w]*|nor[a-df-z0-9_][\w]*|nore[a-oq-z0-9_][\w]*|norep[a-km-z0-9_][\w]*|norepl[a-xz0-9_][\w]*)@google\.com

Breakdown of the Pattern

  1. (?i): This flag makes the regex case insensitive.
  2. [\w]{1,6}: This part matches any email address containing shorter but not complete parts of noreply such as no@google.com
  3. [a-mo-z0-9_][\w]*: This part matches any email that starts with any alphanumeric character or underscore (except for n) and ends with @google.com.
  4. Each subsequent part of the pattern (e.g., n[a-np-z0-9_][\w]*, no[a-qs-z0-9_][\w]*, etc.) is designed to progressively exclude the characters in "noreply" when they appear in the same sequence.
  5. The last part, noreply[\w]*, matches addresses that start with 'noreply' and have additional characters before @google.com.