How to match a regex with backreference in Go?

Question

I need to match a regex that uses backreferences (e.g. \1) in my Go code.

That's not so easy because in Go, the official regexp package uses the RE2 engine, one that have chosen to not support backreferences (and some other lesser-known features) so that there can be a guarantee of linear-time execution, therefore avoiding regex denial-of-service attacks. Enabling backreferences support is not an option with RE2.

In my code, there is no risk of malicious exploitation by attackers, and I need backreferences.

What should I do?

are you sure you *need* to have this fancy regex? what are you trying to solve? — fabrizioM, May 31 '14 at 16:58

score 13 · Answer 1 · answered May 31 '14 at 10:33

13

Answering my own question here, I solved this using golang-pkg-pcre, it uses libpcre++, perl regexes that do support backreferences. The API is not the same.

answered May 31 '14 at 10:33

Eldritch Conundrum

8,452
6
42
50

1

This is great, because I need bug-for-bug compatibility with an existing system. – Riking Feb 08 '18 at 18:29
Would you please post an example? I found that package, couldn't get backreferencing to work, then found an issue on its tracker indicating that others couldn't either. – ESR Jan 01 '19 at 00:33
@ESR I found my old code from 2014, it still works. Even your sample code runs fine. https://gist.github.com/eldritchconundrum/790680bec2a4ce34b6d7a346d239e3a5 – Eldritch Conundrum Jan 01 '19 at 09:52
@esr oh wait, you meant backreferences in the replacement string... No, they don't appear to work. I only needed backreferences in the pattern string, so I didn't notice. – Eldritch Conundrum Jan 01 '19 at 18:44
for those curious, the bug @ESR was referring to is: https://github.com/glenn-brown/golang-pkg-pcre/issues/8 – brent saner Apr 22 '20 at 22:22

score 9 · Answer 2 · answered May 31 '14 at 10:48

9

Regular Expressions are great for working with regular grammars, but if your grammar isn't regular (i.e. requires back-references and stuff like that) you should probably switch to a better tool. There are a lot of good tools available for parsing context-free grammars, including yacc which is shipped with the Go distribution by default. Alternatively, you can also write your own parser. Recursive descent parsers can be easily written by hand for example.

I think regular expressions are overused in scripting languages (like Perl, Python, Ruby, ...) because their C/ASM powered implementation is usually more optimized than those languages itself, but Go isn't such a language. Regular expressions are usually quite slow and are often not suited for the problem at all.

answered May 31 '14 at 10:48

tux21b

90,183
16
117
101

Regexps have legitimate uses even for non-regular grammars. I agree that regexps for parsing is often a poor choice, but not all regexes are used for parsing, e.g. regex search in your favorite text editor. In my case, I'd have no use for a custom parser, I really need a regex, it's the spec. – Eldritch Conundrum May 31 '14 at 11:26
6

regular expressions don't match non-regular grammars. – thwd May 31 '14 at 12:30
4

@tomwilde - Sure they do. The engines in modern regex tools haven't been "[REGULAR](http://kore-nordmann.de/blog/do_NOT_parse_using_regexp.html#comment_40 "See this comment written by the author of the Perl camel book")" for some decades now. – ridgerunner May 31 '14 at 12:38
4

@ridgerunner That's why they are called *extended* regular expressions. – fuz May 31 '14 at 12:45
@fuz not by anyone who knows what they're talking about. That causes confusion with POSIX ERE (which, ironically, is not very "extended" and doesn't even support backreference). – hobbs Oct 10 '18 at 16:48

score 3 · Answer 3 · answered Nov 10 '15 at 22:52

When I had the same problem, I solved it using a two-step regular expression match. The original code is:

if m := match(pkgname, `^(.*)\$\{DISTNAME:S(.)(\\^?)([^:]*)(\\$?)\2([^:]*)\2(g?)\}(.*)$`); m != nil {
    before, _, left, from, right, to, mod, after := m[1], m[2], m[3], m[4], m[5], m[6], m[7], m[8]
    // ...
}

The code is supposed to parse a string of the form ${DISTNAME:S|from|to|g}, which itself is a little pattern language using the familiar substitution syntax S|replace|with|.

The two-stage code looks like this:

if m, before, sep, subst, after := match4(pkgname, `^(.*)\$\{DISTNAME:S(.)([^\\}:]+)\}(.*)$`); m {
    qsep := regexp.QuoteMeta(sep)
    if m, left, from, right, to, mod := match5(subst, `^(\^?)([^:]*)(\$?)`+qsep+`([^:]*)`+qsep+`(g?)$`); m {
        // ...
    }
}

The match, match4 and match5 are my own wrapper around the regexp package, and they cache the compiled regular expressions so that at least the compilation time is not wasted.

score 1 · Answer 4 · edited Oct 19 '21 at 23:42

regexp package funcs FindSubmatchIndex and Expand can capture content by backreferences. It isn't very convenient, but it is still possible. Example

package main

import (
    "fmt"
    "regexp"
)

func main() {
    content := []byte(`
    # comment line
    option1: value1
    option2: value2

    # another comment line
    option3: value3
`)

    pattern := regexp.MustCompile(`(?m)(?P<key>\w+):\s+(?P<value>\w+)$`)

    template := []byte("$key=$value\n")
    result := []byte{}
    for _, submatches := range pattern.FindAllSubmatchIndex(content, -1) {
        result = pattern.Expand(result, template, content, submatches)
    }
    fmt.Println(string(result))
}

output

option1=value1
option2=value2
option3=value3

The question is about something else, but I was looking for this answer. — Ярослав Рахматуллин, Oct 19 '21 at 23:43

eliranwong · Answer 5 · 2022-12-10T17:00:29.773

I think this was an old question, but I haven't found a simple solution from answers above.

In addition, "golang-pkg-pcre" does not work on macOS with M1.

Therefore, I would like to contribute my idea.

For example, to replace or with and replace or with . The search is case-insensitive.

Let me compare how to do it in python and in go

In python, it is easy as below:

import re
content = "<u>test1</u> <i>test2</i>\n<U>test3</U> <I>test4</I>"
content = re.sub(r"<(u|i)>([^<>]+?)</\1>", r"<b>\2</b>", content, flags=re.IGNORECASE)
print(content)

In go, I do it this way:

package main

import (
    "fmt"
    "regexp"
)

func main() {
    content := "<u>test1</u> <i>test2</i>\n<U>test3</U> <I>test4</I>"
    content = changeUITagToBTag(content)
    fmt.Println(content)
}

// change <u> or <i> to <b> and </u> or </i> to </b>
// case-insensitive search
func changeUITagToBTag(content string) string {
    pattern := `<(u|i)>([^<>]+?)</(u|i)>`
    compiledPattern := regexp.MustCompile(fmt.Sprintf(`(?%v)%v`, "i", pattern))
    content = compiledPattern.ReplaceAllStringFunc(content, func(text string) string {
        allSubStrings := compiledPattern.FindAllStringSubmatch(text, -1)
        if allSubStrings[0][1] == allSubStrings[0][3] {
            return fmt.Sprintf(`<b>%s</b>`, allSubStrings[0][2])
        }
        return text
    })
    return content
}

How to match a regex with backreference in Go?

5 Answers5

Linked

Related