15

I need to match a regex that uses backreferences (e.g. \1) in my Go code.

That's not so easy because in Go, the official regexp package uses the RE2 engine, one that have chosen to not support backreferences (and some other lesser-known features) so that there can be a guarantee of linear-time execution, therefore avoiding regex denial-of-service attacks. Enabling backreferences support is not an option with RE2.

In my code, there is no risk of malicious exploitation by attackers, and I need backreferences.

What should I do?

Eldritch Conundrum
  • 8,452
  • 6
  • 42
  • 50

5 Answers5

13

Answering my own question here, I solved this using golang-pkg-pcre, it uses libpcre++, perl regexes that do support backreferences. The API is not the same.

Eldritch Conundrum
  • 8,452
  • 6
  • 42
  • 50
  • 1
    This is great, because I need bug-for-bug compatibility with an existing system. – Riking Feb 08 '18 at 18:29
  • Would you please post an example? I found that package, couldn't get backreferencing to work, then found an issue on its tracker indicating that others couldn't either. – ESR Jan 01 '19 at 00:33
  • @ESR I found my old code from 2014, it still works. Even your sample code runs fine. https://gist.github.com/eldritchconundrum/790680bec2a4ce34b6d7a346d239e3a5 – Eldritch Conundrum Jan 01 '19 at 09:52
  • @esr oh wait, you meant backreferences in the replacement string... No, they don't appear to work. I only needed backreferences in the pattern string, so I didn't notice. – Eldritch Conundrum Jan 01 '19 at 18:44
  • for those curious, the bug @ESR was referring to is: https://github.com/glenn-brown/golang-pkg-pcre/issues/8 – brent saner Apr 22 '20 at 22:22
9

Regular Expressions are great for working with regular grammars, but if your grammar isn't regular (i.e. requires back-references and stuff like that) you should probably switch to a better tool. There are a lot of good tools available for parsing context-free grammars, including yacc which is shipped with the Go distribution by default. Alternatively, you can also write your own parser. Recursive descent parsers can be easily written by hand for example.

I think regular expressions are overused in scripting languages (like Perl, Python, Ruby, ...) because their C/ASM powered implementation is usually more optimized than those languages itself, but Go isn't such a language. Regular expressions are usually quite slow and are often not suited for the problem at all.

tux21b
  • 90,183
  • 16
  • 117
  • 101
  • Regexps have legitimate uses even for non-regular grammars. I agree that regexps for parsing is often a poor choice, but not all regexes are used for parsing, e.g. regex search in your favorite text editor. In my case, I'd have no use for a custom parser, I really need a regex, it's the spec. – Eldritch Conundrum May 31 '14 at 11:26
  • 6
    regular expressions don't match non-regular grammars. – thwd May 31 '14 at 12:30
  • 4
    @tomwilde - Sure they do. The engines in modern regex tools haven't been "[REGULAR](http://kore-nordmann.de/blog/do_NOT_parse_using_regexp.html#comment_40 "See this comment written by the author of the Perl camel book")" for some decades now. – ridgerunner May 31 '14 at 12:38
  • 4
    @ridgerunner That's why they are called *extended* regular expressions. – fuz May 31 '14 at 12:45
  • @fuz not by anyone who knows what they're talking about. That causes confusion with POSIX ERE (which, ironically, is not very "extended" and doesn't even support backreference). – hobbs Oct 10 '18 at 16:48
3

When I had the same problem, I solved it using a two-step regular expression match. The original code is:

if m := match(pkgname, `^(.*)\$\{DISTNAME:S(.)(\\^?)([^:]*)(\\$?)\2([^:]*)\2(g?)\}(.*)$`); m != nil {
    before, _, left, from, right, to, mod, after := m[1], m[2], m[3], m[4], m[5], m[6], m[7], m[8]
    // ...
}

The code is supposed to parse a string of the form ${DISTNAME:S|from|to|g}, which itself is a little pattern language using the familiar substitution syntax S|replace|with|.

The two-stage code looks like this:

if m, before, sep, subst, after := match4(pkgname, `^(.*)\$\{DISTNAME:S(.)([^\\}:]+)\}(.*)$`); m {
    qsep := regexp.QuoteMeta(sep)
    if m, left, from, right, to, mod := match5(subst, `^(\^?)([^:]*)(\$?)`+qsep+`([^:]*)`+qsep+`(g?)$`); m {
        // ...
    }
}

The match, match4 and match5 are my own wrapper around the regexp package, and they cache the compiled regular expressions so that at least the compilation time is not wasted.

Roland Illig
  • 40,703
  • 10
  • 88
  • 121
1

regexp package funcs FindSubmatchIndex and Expand can capture content by backreferences. It isn't very convenient, but it is still possible. Example

package main

import (
    "fmt"
    "regexp"
)

func main() {
    content := []byte(`
    # comment line
    option1: value1
    option2: value2

    # another comment line
    option3: value3
`)

    pattern := regexp.MustCompile(`(?m)(?P<key>\w+):\s+(?P<value>\w+)$`)

    template := []byte("$key=$value\n")
    result := []byte{}
    for _, submatches := range pattern.FindAllSubmatchIndex(content, -1) {
        result = pattern.Expand(result, template, content, submatches)
    }
    fmt.Println(string(result))
}

output

option1=value1
option2=value2
option3=value3

Vladimir Filin
  • 125
  • 1
  • 7
1

I think this was an old question, but I haven't found a simple solution from answers above.

In addition, "golang-pkg-pcre" does not work on macOS with M1.

Therefore, I would like to contribute my idea.

For example, to replace <u> or <I> with <b> and replace </u> or </I> with </b>. The search is case-insensitive.

Let me compare how to do it in python and in go

In python, it is easy as below:

import re
content = "<u>test1</u> <i>test2</i>\n<U>test3</U> <I>test4</I>"
content = re.sub(r"<(u|i)>([^<>]+?)</\1>", r"<b>\2</b>", content, flags=re.IGNORECASE)
print(content)

In go, I do it this way:

package main

import (
    "fmt"
    "regexp"
)

func main() {
    content := "<u>test1</u> <i>test2</i>\n<U>test3</U> <I>test4</I>"
    content = changeUITagToBTag(content)
    fmt.Println(content)
}

// change <u> or <i> to <b> and </u> or </i> to </b>
// case-insensitive search
func changeUITagToBTag(content string) string {
    pattern := `<(u|i)>([^<>]+?)</(u|i)>`
    compiledPattern := regexp.MustCompile(fmt.Sprintf(`(?%v)%v`, "i", pattern))
    content = compiledPattern.ReplaceAllStringFunc(content, func(text string) string {
        allSubStrings := compiledPattern.FindAllStringSubmatch(text, -1)
        if allSubStrings[0][1] == allSubStrings[0][3] {
            return fmt.Sprintf(`<b>%s</b>`, allSubStrings[0][2])
        }
        return text
    })
    return content
}
eliranwong
  • 21
  • 2