5

I am trying to write a scanner in Go that scans continuation lines and also clean the line up before returning it so that you can return logical lines. So, given the following SplitLine function (Play):

func ScanLogicalLines(data []byte, atEOF bool) (int, []byte, error) {
    if atEOF && len(data) == 0 {
        return 0, nil, nil
    }

    i := bytes.IndexByte(data, '\n')
    for i > 0 && data[i-1] == '\\' {
        fmt.Printf("i: %d, data[i] = %q\n", i, data[i])
        i = i + bytes.IndexByte(data[i+1:], '\n')
    }

    var match []byte = nil
    advance := 0
    switch {
    case i >= 0:
        advance, match = i + 1, data[0:i]
    case atEOF: 
        advance, match = len(data), data
    }
    token := bytes.Replace(match, []byte("\\\n"), []byte(""), -1)
    return advance, token, nil
}

func main() {
    simple := `
Just a test.

See what is returned. \
when you have empty lines.

Followed by a newline.
`

    scanner := bufio.NewScanner(strings.NewReader(simple))
    scanner.Split(ScanLogicalLines)
    for scanner.Scan() {
        fmt.Printf("line: %q\n", scanner.Text())
    }
}

I expected the code to return something like:

line: "Just a test."
line: ""
line: "See what is returned, when you have empty lines."
line: ""
line: "Followed by a newline."

However, it stops after returning the first line. The second call return 1, "", nil.

Anybody have any ideas, or is it a bug?

nemo
  • 55,207
  • 13
  • 135
  • 135
Mats Kindahl
  • 1,863
  • 14
  • 25

1 Answers1

7

I would regard this as a bug because an advance value > 0 is not intended to make a further read call, even when the returned token is nil (bufio.SplitFunc):

If the data does not yet hold a complete token, for instance if it has no newline while scanning lines, SplitFunc can return (0, nil) to signal the Scanner to read more data into the slice and try again with a longer slice starting at the same point in the input.

What happens is this

The input buffer of the bufio.Scanner defaults to 4096 byte. That means that it reads up to this amount at once if it can and then executes the split function. In your case the scanner can read your input all at once as it is well below 4096 byte. This means that the next read it will do results in EOF which is the main problem here.

Step by step

  1. scanner.Scan reads all your data
  2. You get all the text that is there
  3. You look for a token, you find the first newline which is only one newline
  4. You return nil as a token by removing the newline from the match
  5. scanner.Scan assumes: user needs more data
  6. scanner.Scan attempts to read more
  7. EOF happens
  8. scanner.Scan tries to tokenize one last time
  9. You find "Just a test."
  10. scanner.Scan tries to tokenize one last time
  11. You look for a token, you find the third line which is only one newline
  12. You return nil as a token by removing the newline from the match
  13. scanner.Scan sees nil token and set error (EOF)
  14. Execution ends

How to circumvent

Any token that is non-nil will prevent this. As long as you return non-nil tokens the scanner will not check for EOF and continues executing your tokenizer.

The reason why your code returns nil tokens is that bytes.Replace returns nil when there's nothing to be done. append([]byte(nil), nil...) == nil. You could prevent this by returning a slice with a capacity and no elements as this would be non-nil: make([]byte, 0, 1) != nil.

nemo
  • 55,207
  • 13
  • 135
  • 135
  • Sorry, but is an empty string nil? Because when I add trace printouts, it prints out `advance=1, token="", err=`. Doing the same thing with the standard ScanLine print the same info, but scanning continues. – Mats Kindahl Nov 13 '13 at 06:35
  • Testing a zero-length slice versus a nil value seem to show they are different. http://play.golang.org/p/lwWIomKcNF – Mats Kindahl Nov 13 '13 at 07:00
  • No, the empty string is not nil. `nil` is the zero value for [some](http://golang.org/ref/spec#The_zero_value) types but not for string. `string([]byte(nil))` is however equal to the empty string. Are you (implicitly) converting `token` to a string? See [this version of your code with debug output](http://play.golang.org/p/HGEoxbqZ9j). – nemo Nov 13 '13 at 13:55
  • 1
    Ah, that explains it. The following code works as expected: http://play.golang.org/p/JoPy9sozb9. Note that the documentation do not state this, so that is at least a bug in the documentation. I find this behaviour very unintuitive and not very useful. It would be better to return the original slice in the event that there are no replacements, or a copy of it. – Mats Kindahl Nov 13 '13 at 18:20