string
values in Go store the UTF-8 encoded bytes of the text, not its characters or rune
s.
Indexing a string
indexes its bytes: str[i]
is of type byte
(or uint8
, its an alias). Also a string
is in effect a read-only slice of bytes (with some syntactic sugar). Indexing a string
does not require converting it to a slice.
When you use for ... range
on a string
, that iterates over the rune
s of the string
, not its bytes!
So if you want to iterate over the runes
(characters), you must use a for ... range
but without a conversion to []byte
, as the first form will not work with string
values containing multi(UTF-8)-byte characters.
The spec allows you to for ... range
on a string
value, and the 1st iteration value will be the byte-index of the current character, the 2nd value will be the current character value of type rune
(which is an alias to int32
):
For a string value, the "range" clause iterates over the Unicode code points in the string starting at byte index 0. On successive iterations, the index value will be the index of the first byte of successive UTF-8-encoded code points in the string, and the second value, of type rune, will be the value of the corresponding code point. If the iteration encounters an invalid UTF-8 sequence, the second value will be 0xFFFD, the Unicode replacement character, and the next iteration will advance a single byte in the string.
Simple example:
s := "Hi 世界"
for i, c := range s {
fmt.Printf("Char pos: %d, Char: %c\n", i, c)
}
Output (try it on the Go Playground):
Char pos: 0, Char: H
Char pos: 1, Char: i
Char pos: 2, Char:
Char pos: 3, Char: 世
Char pos: 6, Char: 界
Must read blog post for you:
The Go Blog: Strings, bytes, runes and characters in Go
Note: If you must iterate over the bytes of a string
(and not its characters), using a for ... range
with a converted string
like your second example does not make a copy, it's optimized away. For details, see golang: []byte(string) vs []byte(*string).