2

In Go, to access elements of a string, we can write:

str := "text"
for i, c := range str {
  // str[i] is of type byte
  // c is of type rune
}

When accessing str[i] does Go perform a conversion from rune to byte? I would guess the answer is yes, but I am not sure. If so, then, which one of the following methods are better performance-wise? Is one preferred over another (in terms of best practice, for example)?

str := "large text"
for i := range str {
  // use str[i]
}

or

str := "large text"
str2 := []byte(str)
for _, s := range str2 {
  // use s
}
icza
  • 389,944
  • 63
  • 907
  • 827
Eissa N.
  • 1,695
  • 11
  • 18

2 Answers2

3

string values in Go store the UTF-8 encoded bytes of the text, not its characters or runes.

Indexing a string indexes its bytes: str[i] is of type byte (or uint8, its an alias). Also a string is in effect a read-only slice of bytes (with some syntactic sugar). Indexing a string does not require converting it to a slice.

When you use for ... range on a string, that iterates over the runes of the string, not its bytes!

So if you want to iterate over the runes (characters), you must use a for ... range but without a conversion to []byte, as the first form will not work with string values containing multi(UTF-8)-byte characters. The spec allows you to for ... range on a string value, and the 1st iteration value will be the byte-index of the current character, the 2nd value will be the current character value of type rune (which is an alias to int32):

For a string value, the "range" clause iterates over the Unicode code points in the string starting at byte index 0. On successive iterations, the index value will be the index of the first byte of successive UTF-8-encoded code points in the string, and the second value, of type rune, will be the value of the corresponding code point. If the iteration encounters an invalid UTF-8 sequence, the second value will be 0xFFFD, the Unicode replacement character, and the next iteration will advance a single byte in the string.

Simple example:

s := "Hi 世界"
for i, c := range s {
    fmt.Printf("Char pos: %d, Char: %c\n", i, c)
}

Output (try it on the Go Playground):

Char pos: 0, Char: H
Char pos: 1, Char: i
Char pos: 2, Char:  
Char pos: 3, Char: 世
Char pos: 6, Char: 界

Must read blog post for you:

The Go Blog: Strings, bytes, runes and characters in Go


Note: If you must iterate over the bytes of a string (and not its characters), using a for ... range with a converted string like your second example does not make a copy, it's optimized away. For details, see golang: []byte(string) vs []byte(*string).

icza
  • 389,944
  • 63
  • 907
  • 827
1

Which one of the following methods are better performance-wise?

Definitely not this.

str := "large text"
str2 := []byte(str)
for _, s := range str2 {
  // use s
}

Strings are immutable. []byte is mutable. That means []byte(str) makes a copy. So the above will copy the entire string. I've found being unaware of when strings are copied to be a major source of performance problems for large strings.

If str2 is never altered, the compiler may optimize away the copy. For this reason, it's better to write the above like so to ensure the byte array is never altered.

str := "large text"
for _, s := range []byte(str) {
  // use s
}

That way there's no str2 to possibly be modified later and ruin the optimization.

But this is a bad idea because it will corrupt any multi-byte characters. See below.


As for the byte/rune conversion, performance is not a consideration as they are not equivalent. c will be a rune, and str[i] will be a byte. If your string contains multi-byte characters, you have to use runes.

For example...

package main

import(
    "fmt"
)

func main() {
    str := "snow ☃ man"
    for i, c := range str {
        fmt.Printf("c:%c str[i]:%c\n", c, str[i])
    }
}

$ go run ~/tmp/test.go
c:s str[i]:s
c:n str[i]:n
c:o str[i]:o
c:w str[i]:w
c:  str[i]: 
c:☃ str[i]:â
c:  str[i]: 
c:m str[i]:m
c:a str[i]:a
c:n str[i]:n

Note that using str[i] corrupts the multi-byte Unicode snowman, it only contains the first byte of the multi-byte character.

There's no performance difference anyway as range str already must do the work to go character-by-character, not byte by byte.

Schwern
  • 153,029
  • 25
  • 195
  • 336
  • The first part about []byte/string is not true per se. The compiler performs escape analysis and some extra optimizations around []byte/string to avoid some allocations. If you compile your first snippet with '-gcflags "-m -m"' you will see that this case won't allocate since str2 is only used in the loop and does not escape. – nussjustin Jun 11 '17 at 19:51
  • @nussjustin Yeah. Writing it as `for ... range []byte(str)` makes the optimization more obvious. – Schwern Jun 11 '17 at 19:53
  • In general converting `string` to `[]byte` does make a copy, but if used in a `for ... range`, it won't. It is optimized away by the compiler. For details see this [answer](https://stackoverflow.com/questions/43470284/golang-bytestring-vs-bytestring/43470344#43470344). – icza Jun 11 '17 at 20:20