-1

Here is a simple regular expression:

package main

import (
    "fmt"
    "regexp"
)

const data = "abcdefghijklmn"

func main() {
    r, err := regexp.Compile(".{1,6}")
    if err != nil {
        panic(err)
    }
    for _, d := range r.FindAllIndex([]byte(data), -1) {
        fmt.Println(data[d[0]:d[1]])
    }
}

And we know it is greedy:

abcdef
ghijkl
mn

Now, we can add a ? after the expression to make it non greedy:

package main

import (
    "fmt"
    "regexp"
)

const data = "abcdefghijklmn"

func main() {
    r, err := regexp.Compile(".{1,6}?")
    if err != nil {
        panic(err)
    }
    for _, d := range r.FindAllIndex([]byte(data), -1) {
        fmt.Println(data[d[0]:d[1]])
    }
}

And we can get:

a
b
c
d
e
f
g
h
i
j
k
l
m
n

However, if we add other chars after the expression, it becomes greedy:

package main

import (
    "fmt"
    "regexp"
)

const data = "abcdefghijklmn"

func main() {
    r, err := regexp.Compile(".{1,6}?k")
    if err != nil {
        panic(err)
    }
    for _, d := range r.FindAllIndex([]byte(data), -1) {
        fmt.Println(data[d[0]:d[1]])
    }
}

And we get:

efghijk

So why it becomes greedy if we add a char after it?

grt1st
  • 11
  • 1
  • 4
  • And is there a way to make it still non greedy after we add char after the expression? – grt1st Jun 07 '20 at 07:34
  • I don't understand what you want. Why would you expect `.{1,6}?k` to _not_ match `efghijk`? – Jonathan Hall Jun 07 '20 at 08:00
  • Left-most matching trumps greediness. This works the same in all regex engines, not just RE2. Use `.k` to match just "jk". – Peter Jun 07 '20 at 08:10
  • `.{1,6}?k` is still **lazy**, period. Just the string is parsed from left to right, and all matches start at the leftmost matching positions. – Wiktor Stribiżew Jun 07 '20 at 10:07

1 Answers1

2

Adding a lazy quantifier after a repetition count changes it from matching as many as possible, to as few as possible.

However, this does not change the fact that the string must be processed serially. This is where your two cases differ:

  • .{1,6}? returns one character at a time because this is the fewest matches as the string is being processed. The lazy quantifier lets the engine match after a single character, not needing to keep processing the string.
  • .{1,6}?k has to skip over abcd to get a match, but it then finds the substring starting at e to be a match. A lazy quantifier does not let the engine move to the next character in the string.

In short: matching from the current position takes precedence over moving to the next position in the hope of a smaller match.

As for your question about making it lazy again, you can't. You'll have to find a different regular expression for the output you want.

Marc
  • 19,394
  • 6
  • 47
  • 51