2

Any (format) strings in a file (C or C++ code), even containing escaped characters or newlines are needed to be found by a tool written in Go. Examples:

..."foo"...
...`foo:"foo"`...
..."foo
foo"...
..."foo\r\nfoo"...
...`foo"foo-

lish`

The C/C++ parsing is allowed to be done also in comments or deactivated code, so no need to exclude that parts.

I succeeded with

/(["'`])(?:(?=(\?))\2.)*?\1/gms

on https://regex101.com/r/FDhldb/1 searching for a solution.

Unfortunately this does not compile in Go:

const (
patFmtString = `(?Us)(["'])(?:(?=(\\?))\2.)*?\1`
)
var (
matchFmtString = regexp.MustCompile(patFmtString)
) 

Even the simplified pattern (?Us)(["'])(?:(\\?).)*?\1 delivers "error parsing regexp: invalid escape sequence: \1".

How do I correctly implement that in Go, hopefully running also fast?

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 2
    Go don't support backreferences in regex. [Consider using alternative realization of regex], or alternative approach (it might be possible that manual parsing in this case will be adequate). – markalex Jun 30 '23 at 08:43
  • 3
    Please note that C and C++ are two very different languages. That even applies to things like literal strings where there are differences between the two languages (C++ have "raw" string literals, which C doesn't). – Some programmer dude Jun 30 '23 at 08:43
  • 1
    you can maybe find your answer in this topic : https://stackoverflow.com/questions/23968992/how-to-match-a-regex-with-backreference-in-go – Schnitter Jun 30 '23 at 09:10

1 Answers1

1

You can use a reasonably simple Scanner to accomplish this instead of using PCRE:

import "bufio"

var stringLiterals bufio.SplitFunc = func(data []byte, atEOF bool) (advance int, token []byte, err error) {
    scanning := false
    var delim byte
    var i int
    var start, end int
    for i < len(data) {
        b := data[i]
        switch b {
        case '\\': // skip escape sequences
            i += 2
            continue
        case '"':
            fallthrough
        case '\'':
            fallthrough
        case '`':
            if scanning && delim == b {
                end = i + 1
                token = data[start:end]
                advance = end
                return
            } else if !scanning {
                scanning = true
                start = i
                delim = b
            }
        }
        i++
    }
    if atEOF {
        return len(data), nil, nil
    }
    return start, nil, nil
}

and use it like

func main() {
    input := /* some reader */
    scanner := bufio.NewScanner(input)
    scanner.Split(stringLiterals)
    for scanner.Scan() {
        stringLit := scanner.Text()
        // do something with `stringLit`
    }
}

For you examples, this returns exactly the matches that your regex does, though I'm not sure that actually corresponds to the way C++ string literals are defined in the grammar.

You can try it out on the playground.

isaactfa
  • 5,461
  • 1
  • 10
  • 24
  • `func matchFormatString(input string) (loc []int) { scanner := bufio.NewScanner(strings.NewReader(input)) scanner.Split(stringLiterals) if scanner.Scan() { loc = append(loc, strings.Index(input, scanner.Text())) loc = append(loc, loc[0]+len(scanner.Text())) } return }` – Thomas Höhenleitner Jul 04 '23 at 11:14