Type of string elements is uint8 using index and int32 on value

Question

Here I am checking the type of each elements of string s using the index s[k] and value v but returning different outputs. Using index i am getting the type uint8 but for value semantics I am getting the int32.

func main() {
    s := "AaBbCcXxYyZz"
    for k,v := range s {
        fmt.Printf("%v\t%T\t%s\n", s[k], s[k], string(s[k]))
        fmt.Printf("%v\t%T\t%s\n", v, v, string(v))
    } 
}

Pak Uula · Answer 1 · 2022-11-06T07:37:30.167

1

The loop for k,v := range s {} iterates over unicode codepoints. In Golang they are called runes and are represented as 32-bit signed inegers:

For a string value, the "range" clause iterates over the Unicode code points in the string starting at byte index 0. On successive iterations, the index value will be the index of the first byte of successive UTF-8-encoded code points in the string, and the second value, of type rune, will be the value of the corresponding code point. If the iteration encounters an invalid UTF-8 sequence, the second value will be 0xFFFD, the Unicode replacement character, and the next iteration will advance a single byte in the string.

Golang specification

The indexing s[k] returns the byte in the internal representation of the string.

The difference is easy to see for multibyte alphabets, such as Chinese. Try iterate the string "給祭断情試紀脱答条証行日稿" (it a meaningless lorem impsum phrase in chinese):

s[0]: 231   uint8   ç
     :32102 int32   給
s[3]: 231   uint8   ç
     :31085 int32   祭
s[6]: 230   uint8   æ
     :26029 int32   断

See the step between the values of k? It is due to utf-8 encoding of those chinese characters occupies 3 bytes. Full example: https://go.dev/play/p/-44NZMojcgq

edited Nov 06 '22 at 07:37

answered Nov 06 '22 at 06:58

Pak Uula

2,750
1
8
13

1

A rune is represented by the type `rune` which is a synonym for `int32` (not a 32-bit unsigned integer). A string contains arbitrary bytes, the k'th of which is returned by `s[k]` - there's no "internal utf-8 representation of the string". See https://go.dev/blog/strings – Paul Hankin Nov 06 '22 at 07:09
@PaulHankin thank you - runes are *signed* integers, fixed. As for UTF-8: [The value of a raw string literal is the string composed of the uninterpreted (implicitly **UTF-8-encoded**) characters between the quotes](https://go.dev/ref/spec#String_literals) (Golang specs) – Pak Uula Nov 06 '22 at 07:24
2

The quote is not talking about strings, it's talking about raw string literals. Yes, a raw string literal (ie: one surrounded by backquotes) is encoded into utf-8 to form a string, but that doesn't mean a string is always utf-8 -- it's not, it's an arbitrary sequence of bytes, and may not contain valid utf-8. – Paul Hankin Nov 06 '22 at 07:38
2

Good reading: https://go.dev/ref/spec#String_types and https://go.dev/blog/strings – Paul Hankin Nov 06 '22 at 07:45
@PaulHankin I modified the answer. `range` iterator for a string interprets it as a sequence of utf-8-encoded unicode codepoints. The string literal that TS iterates is internally encoded as UTF-8 – Pak Uula Nov 06 '22 at 07:46

Type of string elements is uint8 using index and int32 on value

1 Answers1