0

I get that a rune is an alias for int32 because it's supposed to hold all valid Unicode code points. There are apparently about 1,114,112 Unicode code points currently, so it would make sense that they would have to be stored in four bits, or an int32-sized register, which can store an integer up to 2,147,483,647.

I have a few questions about binary encoding of UTF-8 characters and integers, however.

  • It appears that both rune and int32 both occupy four bytes. If 2147483647 is the highest integer able to be represented in four bytes (four bit octets), why is its binary representation 1111111111111111111111111111111, i.e., 31 1's instead of 32? Is there a bit reserved for its sign? It's possible that there's a bug in the binary converter I used, because -2147483648 should still be able to be represented in 4 bytes (as it's still able to be represented in the int32 type), but it is output there as 1111111111111111111111111111111110000000000000000000000000000000, i.e., 33 1's and 31 0's which clearly overruns a four byte allowance. What's the story there?
  • In the binary conversion, how would the compiler differentiate between a rune like 'C' (01000011, according to the unicode to binary table and the integer 67 (also 01000011, according to the binary to decimal converter I used). Intuition tells me that some of the bits must be reserved for that information. Which ones?

I've done a fair amount of Googling, but I'm obviously missing the resources that explain this well, so feel free to explain like I'm 5. Please also feel free to correct terminology misuses.

Community
  • 1
  • 1
nickcoxdotme
  • 6,567
  • 9
  • 46
  • 72
  • Yes, there is indeed a sign bit (but it might not work quite how you think; read on). Your converter is returning an `int64`; [sign extension](https://en.wikipedia.org/wiki/Sign_extension) is why those top bits are 1 not 0. Sign extension is an aspect of [two's complement math](https://en.wikipedia.org/wiki/Two%27s_complement), where `0-1` "wraps around" (as it does for unsigned) and -1 is represented as all 1 bits internally. – twotwotwo Feb 29 '16 at 18:26
  • When your program runs, no bits are reserved to distinguish between the `rune` and the `int32`--the difference is entirely at compile time, when the compiler, e.g. ensures any [assignments](https://golang.org/ref/spec#Assignability) are proper and chooses the right compiled code to output for, say, [`string` iteration or `[]rune` to `string` conversaion](http://stackoverflow.com/questions/34861479/how-to-detect-when-bytes-cant-be-converted-to-string-in-go). – twotwotwo Feb 29 '16 at 18:29
  • Note that UTF-8 is a _different_ encoding from what you see in a `rune`. [It's byte-oriented](https://en.wikipedia.org/wiki/UTF-8) and it matches [how the bytes in a `string` exist in memory](http://blog.golang.org/strings). The 32-bit `rune` tends to be a type you use during text processing when, e.g. you need to actually know the code point of a CJK character, not just the sequence of bytes representing it. – twotwotwo Feb 29 '16 at 18:33
  • And am I right in thinking that a CJK (Chinese, Japanese, Korean, for those finding this via Google) character is likely a multibyte character and can be represented by a single `rune`, but not a single `byte`? – nickcoxdotme Feb 29 '16 at 18:36
  • Yep, that's correct. – twotwotwo Feb 29 '16 at 18:37
  • Then is there something in the bit representation of that `rune` that allows Go to display it as, say, a `您` (binary `110000010101000`, found [here](http://software.ellerton.net/txt2bin/)) and not `24744`, the decimal representation of that binary, if `rune` and `int32` are aliases? – nickcoxdotme Feb 29 '16 at 18:40
  • No, there's nothing in the bit representation of the `rune` in memory at run time. At _compile time_ Go knows `rune`s should have different checks applied to them and special code emitted for them in some cases (like [`[]rune` to `string` conversion](https://golang.org/ref/spec#Conversions_to_and_from_a_string_type)). (And in reality, `fmt.Println('您')` [does output the number](http://play.golang.org/p/3JDXBjzxgj), but I think the point that type can be a purely compile-time difference, unlike in Ruby/JS/Python, bears understanding separately from the specifics here.) – twotwotwo Feb 29 '16 at 18:46
  • This sentence: "At compile time Go knows runes should have different checks applied to them and special code emitted for them in some cases than int32s." I think _that's_ the crux of my question. What _are_ those checks? How does that work at a low level? Or is that too complicated to explain? – nickcoxdotme Feb 29 '16 at 18:49
  • Yeah, you need to read about how statically vs. dynamically typed languages work under the hood to understand that--once you do, the [spec](https://golang.org/ref/spec) is surprisingly concise and illuminates the Go-specific stuff. Alternately, just playing around in Go or another statically typed language may help to develop your intuition here. The [Tour](http://tour.golang.org/welcome/1), if you haven't done it, is a great place to start. – twotwotwo Feb 29 '16 at 18:52
  • @twotwotwo Any ideas for resources on "how statically vs. dynamically typed languages work under the hood"? With that, and if you put all your comments together as an answer, I'll mark it correct. – nickcoxdotme Mar 01 '16 at 16:24

0 Answers0