string
values in Go are stored as read only byte slices ([]byte
), where the bytes are the UTF-8 encoded bytes of the (rune
s of the) string
. UTF-8 is a variable-length encoding, different Unicode code points may be encoded using different number of bytes. For example values in the range 0..127
are encoded as a single byte (whose value is the unicode codepoint itself), but values greater than 127 use more than 1 byte. The unicode/utf8
package contains UTF-8 related utility functions and constants, for example utf8.UTFMax
reports the maximum number of bytes a valid Unicode codepoint may "occupy" in UTF-8 encoding (which is 4).
One thing to note here: not all possible byte sequences are valid UTF-8 sequences. A string
may be any byte sequence, even those that are invalid UTF-8 sequences. For example the string
value "\xff"
represents an invalid UTF-8 byte sequence, for details, see How do I represent an Optional String in Go?
The for range
construct –when applied on a string
value– iterates over the runes of the string
:
For a string value, the "range" clause iterates over the Unicode code points in the string starting at byte index 0. On successive iterations, the index value will be the index of the first byte of successive UTF-8-encoded code points in the string, and the second value, of type rune
, will be the value of the corresponding code point. If the iteration encounters an invalid UTF-8 sequence, the second value will be 0xFFFD
, the Unicode replacement character, and the next iteration will advance a single byte in the string.
The for range
construct may produce 1 or 2 iteration values. When using 2, like in your example:
for pos, char := range "日本\x80語" {
fmt.Printf("Character %#U, at position: %d\n", char, pos)
}
For each iteration, pos
will be byte index of the rune / character, and char
will be the rune of the string
. As you can see in the quote above, if the string
is an invalid UTF-8 byte sequence, when an invalid UTF-8 sequence is encountered, char
will be 0xFFFD
(the Unicode replacement character), and the for range
construct (the iteration) will advance a singe byte only.
To sum it up: The position is always the byte index of the rune
of the current iteration (or more specifically: the byte index of the first byte of the UTF-8 encoded sequence of the rune
of the current iteration), but if invalid UTF-8 sequence is encountered, the position (index) will only be incremented by 1 in the next iteration.
A must-read blog post if you want to know more about the topic:
The Go Blog: Strings, bytes, runes and characters in Go