What determines the position of a character when looping through UTF-8 strings?

Question

I am reading the section on for statements in the Effective Go documentation and came across this example:

for pos, char := range "日本\x80語" {
    fmt.Printf("Character %#U, at position: %d\n", char, pos)
}

The output is:

Character U+65E5 '日', at position: 0
Character U+672C '本', at position: 3
Character U+FFFD '�', at position: 6
Character U+8A9E '語', at position: 7

What I don't understand is why the positions are 0, 3, 6, and 7. This tells me the first and second character is 3 bytes long and the 'replacement rune' (U+FFFD) is 1 byte long, which I accept and understand. However, I thought rune was of int32 type and therefore would be 4 bytes each, not three.

Why are the positions in a range different to the total amount of memory each value should be consuming?

score 6 · Accepted Answer · edited May 23 '17 at 11:52

string values in Go are stored as read only byte slices ([]byte), where the bytes are the UTF-8 encoded bytes of the (runes of the) string. UTF-8 is a variable-length encoding, different Unicode code points may be encoded using different number of bytes. For example values in the range 0..127 are encoded as a single byte (whose value is the unicode codepoint itself), but values greater than 127 use more than 1 byte. The unicode/utf8 package contains UTF-8 related utility functions and constants, for example utf8.UTFMax reports the maximum number of bytes a valid Unicode codepoint may "occupy" in UTF-8 encoding (which is 4).

One thing to note here: not all possible byte sequences are valid UTF-8 sequences. A string may be any byte sequence, even those that are invalid UTF-8 sequences. For example the string value "\xff" represents an invalid UTF-8 byte sequence, for details, see How do I represent an Optional String in Go?

The for range construct –when applied on a string value– iterates over the runes of the string:

For a string value, the "range" clause iterates over the Unicode code points in the string starting at byte index 0. On successive iterations, the index value will be the index of the first byte of successive UTF-8-encoded code points in the string, and the second value, of type rune, will be the value of the corresponding code point. If the iteration encounters an invalid UTF-8 sequence, the second value will be 0xFFFD, the Unicode replacement character, and the next iteration will advance a single byte in the string.

The for range construct may produce 1 or 2 iteration values. When using 2, like in your example:

for pos, char := range "日本\x80語" {
    fmt.Printf("Character %#U, at position: %d\n", char, pos)
}

For each iteration, pos will be byte index of the rune / character, and char will be the rune of the string. As you can see in the quote above, if the string is an invalid UTF-8 byte sequence, when an invalid UTF-8 sequence is encountered, char will be 0xFFFD (the Unicode replacement character), and the for range construct (the iteration) will advance a singe byte only.

To sum it up: The position is always the byte index of the rune of the current iteration (or more specifically: the byte index of the first byte of the UTF-8 encoded sequence of the rune of the current iteration), but if invalid UTF-8 sequence is encountered, the position (index) will only be incremented by 1 in the next iteration.

A must-read blog post if you want to know more about the topic:

The Go Blog: Strings, bytes, runes and characters in Go

That blog post was very useful and I recommend anyone with a similar query to read it. I think my confusion was thinking `range` performed some implicit conversion to a slice of runes, when actually it does not. Thank you. — HenryTK, Jan 21 '17 at 13:16

cshu · Answer 2 · 2017-01-21T12:41:12.623

rune is code point. Code point is just integer. You can even use int64 to store it if you want to. (But Unicode only has 1,114,112 code points so int32 should be the right choice. No wonder rune is alias of int32 in Golang.)

Different encoding schemes encode code points in different ways. E.g. CJK character is usually encoded to 3 bytes in UTF-8, and to 2 bytes in UTF-16.

String literal in Golang is UTF-8.

What determines the position of a character when looping through UTF-8 strings?

2 Answers2

Linked