For a fun exercise I wondered if I could tokenize simple arithmetic expressions (containing only positive integers and the four basic operations) using a regular expression, so I came up with the following:
But the test cases below do not behave as I expected due to the failures listed at the end (Go Playground):
func TestParseCalcExpression(t *testing.T) {
re := regexp.MustCompile(`^(\d+)(?:([*/+-])(\d+))*$`)
for _, eg := range []struct {
input string
expected [][]string
}{
{"1", [][]string{{"1", "1", "", ""}}},
{"1+1", [][]string{{"1+1", "1", "+", "1"}}},
{"22/7", [][]string{{"22/7", "22", "/", "7"}}},
{"1+2+3", [][]string{{"1+2+3", "1", "+", "2", "+", "3"}}},
{"2*3+5/6", [][]string{{"2*3+5/6", "2", "*", "3", "+", "5", "/", "6"}}},
} {
actual := re.FindAllStringSubmatch(eg.input, -1)
if !reflect.DeepEqual(actual, eg.expected) {
t.Errorf("expected parse(%q)=%#v, got %#v", eg.input, eg.expected, actual)
}
}
}
// === RUN TestParseCalcExpression
// prog.go:24: expected parse("1+2+3")=[][]string{[]string{"1+2+3", "1", "+", "2", "+", "3"}}, got [][]string{[]string{"1+2+3", "1", "+", "3"}}
// prog.go:24: expected parse("2*3+5/6")=[][]string{[]string{"2*3+5/6", "2", "*", "3", "+", "5", "/", "6"}}, got [][]string{[]string{"2*3+5/6", "2", "/", "6"}}
// --- FAIL: TestParseCalcExpression (0.00s)
// FAIL
I was hoping that the "zero or more repetition" of the non-matching subgroup ((?:...)*
) which identifies and groups operators and numbers (([*/+-])(\d+)
) would match all occurrences of that sub-expression but it only appears to match the last one.
On the one hand, this makes sense because the regex literally has only three matching groups, so it follows that any resulting match could only have three matches. However, the "zero or more repetition" makes it seem like it's missing all the "middle" repeated items in the failed tests (e.g. +2
in 1+2+3
).
// expected parse("1+2+3")=
// [][]string{[]string{"1+2+3", "1", "+", "2", "+", "3"}},
// got [][]string{[]string{"1+2+3", "1", "+", "3"}}
Is there a way to parse these kinds of arithmetic expressions using go regular expressions or is this a fundamental limitation of regular expressions (or go/re2 regexps, or the general combination of non/capturing groups)?
(I realize I could just split by word boundaries and scan the tokens to validate the structure but I'm more interested in this limitation of non/capturing groups than the example problem.)