I get that a rune
is an alias for int32
because it's supposed to hold all valid Unicode code points. There are apparently about 1,114,112 Unicode code points currently, so it would make sense that they would have to be stored in four bits, or an int32
-sized register, which can store an integer up to 2,147,483,647.
I have a few questions about binary encoding of UTF-8 characters and integers, however.
- It appears that both
rune
andint32
both occupy four bytes. If 2147483647 is the highest integer able to be represented in four bytes (four bit octets), why is its binary representation1111111111111111111111111111111
, i.e., 31 1's instead of 32? Is there a bit reserved for its sign? It's possible that there's a bug in the binary converter I used, because -2147483648 should still be able to be represented in 4 bytes (as it's still able to be represented in theint32
type), but it is output there as1111111111111111111111111111111110000000000000000000000000000000
, i.e., 33 1's and 31 0's which clearly overruns a four byte allowance. What's the story there? - In the binary conversion, how would the compiler differentiate between a
rune
like 'C' (01000011
, according to the unicode to binary table and the integer 67 (also01000011
, according to the binary to decimal converter I used). Intuition tells me that some of the bits must be reserved for that information. Which ones?
I've done a fair amount of Googling, but I'm obviously missing the resources that explain this well, so feel free to explain like I'm 5. Please also feel free to correct terminology misuses.