1

How to efficiently replace strings occurrences between two strings delimiters using Go bytes?

For example my flat file (3Mb) content is similar to:

Lorem START ipsum END dolor sit amet, START adipiscing END elit.
Ipsum dolor START sit END amet, START elit. END
.....

I would like to replace all ocurrencies between START and END delimiters. Like my file size is 3Mb it's bad idea to load whole content in memory.

Thanks.

joseluisq
  • 508
  • 1
  • 7
  • 19

1 Answers1

5

You can use bufio.Scanner with bufio.ScanWords, tokenize on whitespace boundaries, and compare non-whitespace sequences to your delimiter:

scanner := bufio.NewScanner(reader)

scanner.Split(bufio.ScanWords) // you can implement your own split function
                               // but ScanWords will suffice for your example

for scanner.Scan() {
    // scanner.Bytes() efficiently exposes the file contents
    // as slices of a larger buffer
    if bytes.HasPrefix(scanner.Bytes(), []byte("START")) {
        ... // keep scanning until the end delimiter
    }

    // copying unmodified inputs is quite simple:
    _, err := writer.Write( scanner.Bytes() )
    if err != nil {
        return err
    }
}

This will ensure that the amount of data read in from the file remains bounded (this is controlled by MaxScanTokenSize)

Note that if you want to use multiple goroutines, you'll need to copy the data first, since scanner.Bytes() returns a slice that is only valid until the next call to .Scan(), but if you choose to do that then I wouldn't bother with a scanner.

For what it's worth, a 3MB size file is actually not such a bad idea to load on a general purpose computer nowadays, I would only think twice if it was an order of magnitude bigger. It would almost certainly be faster to use bytes.Split with your delimiters.

nothingmuch
  • 1,456
  • 9
  • 10
  • 1
    Scanner & Split both good options, I'd also add regexp as a potential solution, though regexp performance in Go is unfortunately poor due to UTF-8 handling. – Adrian May 19 '17 at 15:39
  • Well, regexps have the additional advantage of taking care of the substitution logic, I didn't reach for that here because it seemed to me like the data in between the delimiters would need some processing, but reading the question more carefully it doesn't actually say that =) – nothingmuch May 19 '17 at 16:02
  • Yep, I'm a beginner in Go. In fact I thought in Regex (more "easy" to implement) but then I thought in its performance with large files. Although 3Mb (or 10Mb) is not a problem for Go apparently. :D – joseluisq May 19 '17 at 16:15
  • Go is a really nice mix of the features from dynamic languages that really make a difference, but most of the precision and directness of C, it's a great language for this kind of task. Easy things are easy, and it's usually very easy to optimize aggressively when the need actually arises. – nothingmuch May 19 '17 at 16:27
  • Yeah, you right! I came from dynamic scripting language like javascript (node.js) but I should say that my first experience with Go is a very comfortable experience. – joseluisq May 19 '17 at 17:04
  • @nothingmuch I found this issue with `Scanner` http://stackoverflow.com/a/41741702/2510591 What do you think about? – joseluisq May 20 '17 at 09:55