0

For a fun exercise I wondered if I could tokenize simple arithmetic expressions (containing only positive integers and the four basic operations) using a regular expression, so I came up with the following:

But the test cases below do not behave as I expected due to the failures listed at the end (Go Playground):

func TestParseCalcExpression(t *testing.T) {
    re := regexp.MustCompile(`^(\d+)(?:([*/+-])(\d+))*$`)

    for _, eg := range []struct {
        input    string
        expected [][]string
    }{
        {"1", [][]string{{"1", "1", "", ""}}},
        {"1+1", [][]string{{"1+1", "1", "+", "1"}}},
        {"22/7", [][]string{{"22/7", "22", "/", "7"}}},
        {"1+2+3", [][]string{{"1+2+3", "1", "+", "2", "+", "3"}}},
        {"2*3+5/6", [][]string{{"2*3+5/6", "2", "*", "3", "+", "5", "/", "6"}}},
    } {
        actual := re.FindAllStringSubmatch(eg.input, -1)
        if !reflect.DeepEqual(actual, eg.expected) {
            t.Errorf("expected parse(%q)=%#v, got %#v", eg.input, eg.expected, actual)
        }
    }
}

// === RUN   TestParseCalcExpression
//      prog.go:24: expected parse("1+2+3")=[][]string{[]string{"1+2+3", "1", "+", "2", "+", "3"}}, got [][]string{[]string{"1+2+3", "1", "+", "3"}}
//      prog.go:24: expected parse("2*3+5/6")=[][]string{[]string{"2*3+5/6", "2", "*", "3", "+", "5", "/", "6"}}, got [][]string{[]string{"2*3+5/6", "2", "/", "6"}}
// --- FAIL: TestParseCalcExpression (0.00s)
// FAIL

I was hoping that the "zero or more repetition" of the non-matching subgroup ((?:...)*) which identifies and groups operators and numbers (([*/+-])(\d+)) would match all occurrences of that sub-expression but it only appears to match the last one.

On the one hand, this makes sense because the regex literally has only three matching groups, so it follows that any resulting match could only have three matches. However, the "zero or more repetition" makes it seem like it's missing all the "middle" repeated items in the failed tests (e.g. +2 in 1+2+3).

// expected parse("1+2+3")=
//     [][]string{[]string{"1+2+3", "1", "+", "2", "+", "3"}},
// got [][]string{[]string{"1+2+3", "1", "+", "3"}}

Is there a way to parse these kinds of arithmetic expressions using go regular expressions or is this a fundamental limitation of regular expressions (or go/re2 regexps, or the general combination of non/capturing groups)?

(I realize I could just split by word boundaries and scan the tokens to validate the structure but I'm more interested in this limitation of non/capturing groups than the example problem.)

maerics
  • 151,642
  • 46
  • 269
  • 291

1 Answers1

1
package main

import (
    "reflect"
    "regexp"
    "testing"
)

func TestParseCalcExpression(t *testing.T) {
    re := regexp.MustCompile(`(\d+)([*/+-]?)`)

    for _, eg := range []struct {
        input    string
        expected [][]string
    }{
        {"1", [][]string{{"1", "1", ""}}},
        {"1+1", [][]string{{"1+", "1", "+"}, {"1", "1", ""}}},
        {"22/7", [][]string{{"22/", "22", "/"}, {"7", "7", ""}}},
        {"1+2+3", [][]string{{"1+", "1", "+"}, {"2+", "2", "+"}, {"3", "3", ""}}},
        {"2*3+5/6", [][]string{{"2*", "2", "*"}, {"3+", "3", "+"}, {"5/", "5", "/"}, {"6", "6", ""}}},
    } {
        actual := re.FindAllStringSubmatch(eg.input, -1)
        if !reflect.DeepEqual(actual, eg.expected) {
            t.Errorf("expected parse(%q)=%#v, got %#v", eg.input, eg.expected, actual)
        }
    }
}

Playground link

As mentioned in this question about Swift (I'm not a Swift or regex expert so I'm just guessing this applies to Go as well), you can only return one match for each matching group in your regex. It seems to just identify the last match if the group is repeating.

From the Go standard library regexp package documentation:

If 'Submatch' is present, the return value is a slice identifying the successive submatches of the expression. Submatches are matches of parenthesized subexpressions (also known as capturing groups) within the regular expression, numbered from left to right in order of opening parenthesis. Submatch 0 is the match of the entire expression, submatch 1 the match of the first parenthesized subexpression, and so on.

Given this convention, returning multiple matches per match group would break the numbering and therefore you wouldn't know which items were associated with each matching group. It seems it's possible that a regex engine could return multiple matches per group, but this package couldn't do that without breaking this convention stated in the documentation.

My solution is to make your problem more regular. Instead of treating the entire expression as one match, which gave us the problem that we can only return finitely many strings per match, we treat the entire expression as simply a series of pairs.

Each pair is composed of a number (\d+), and an optional operator ([*/+-]?).

Then doing a FindAllStringSubmatch on the whole expression, we extract a series of these pairs and get the number and operator for each.

For example: "1+2+3" returns [][]string{{"1+", "1", "+"}, {"2+", "2", "+"}, {"3", "3", ""}}}

This only tokenizes the expression; it doesn't validate it. If you need the expression to be validated, then you'll need another initial regex match to verify that the string is indeed an unbroken series of these pairs.

Hymns For Disco
  • 7,530
  • 2
  • 17
  • 33