Why is rune in golang an alias for int32 and not uint32?

Question

The type rune in Go is defined as

an alias for int32 and is equivalent to int32 in all ways. It is used, by convention, to distinguish character values from integer values.

If the intention is to use this type to represent character values, why did the authors of the Go language do not use uint32 instead of int32? How do they expect a rune value to be handled in a program, when it is negative? The other similar type, byte, is an alias for uint8 (and not int8), which seems reasonable.

Note: `byte` is an alias for `uint8`, not `uint`. – Filipe Gonçalves Aug 26 '15 at 23:11 — Filipe Gonçalves, Aug 26 '15 at 23:11
You selected the right answer before, what has changed? – VonC May 17 '18 at 18:44 — VonC, May 17 '18 at 18:44

score 29 · Accepted Answer · edited Nov 12 '19 at 08:09

29

I googled and found this

This has been asked several times. rune occupies 4 bytes and not just one because it is supposed to store unicode codepoints and not just ASCII characters. Like array indices, the datatype is signed so that you can easily detect overflows or other errors while doing arithmetic with those types.

edited Nov 12 '19 at 08:09

Sai Ravi Teja K

569
4
11

answered Jul 12 '14 at 16:08

chendesheng

1,969
18
14

1

All answers in that thread argue that there is enough space to reference all code points of Unicode in a signed 32 bit integer. Hence, I do understand how rune is big enough to address the Unicode range. The question still remains about the choice of type. Why not uint16 (which has comparable range of values for positive integers) but uses only half the space as int32? – Jul 12 '14 at 16:20
3

@TapanKarecha: uint16 doesn’t fit all of Unicode, though. It fits a really big chunk of it, but Unicode ends at `0x10fffd`. – Ry- Jul 12 '14 at 16:21
I see your point @false. I agree that uint16 is not big enough. The question (at the risk of sounding repetitive) about the choice of int32 instead of uint32 still remains. – Jul 12 '14 at 16:26
4

Yes: `uint` can have hard-to-debug behavior like `a-b > 1000` when `a=1` and `b=2` ([play](http://play.golang.org/p/lsdiZJiN7V)). So Go uses `int` where it can. – twotwotwo Jul 13 '14 at 02:21
@twotwotwo I agree with what you state.. but I do wonder (and the OP too) why doesn't apply the same design decistion to byte... it could be int8 instead of uint and enjoy the same benefits of working with negative values – Victor May 28 '23 at 03:45
@Victor For bytes, the (human) convention when talking about byte values is already that 0xFF is 255, not -1, so Go is matching that. Unicode uses no codepoint values over U+10FFFF, well short of the int32 wraparound point, as Ry-'s answer points out below. So making the 32-bit integer signed doesn't risk valid codepoints people think of as positive being displayed as negative. – twotwotwo Jun 13 '23 at 03:13
@twotwotwo thanks for pointing me some context.... but I do not get it, please forgive me. I mean, the `a-b > 1000` problem is actually present using uint that represents 'a' and 'b' => https://go.dev/play/p/SKIWAf-dt-c.. I still do wonder the same thing as before... so problably I haven't understand what you sayd, except that value that results of that minus operation doesn't correspondt to any existant unicode character, right? but you got a plus sign on a aritmetic operation that should be negative... leading to the problem you stated. – Victor Jun 13 '23 at 19:52
1

@Victor It's not that the overflow problem doesn't exist for bytes, it's that we have to tolerate the problem for bytes because using int8 would introduce a different problem. Users think of 0xFF, a valid byte value, as 255 not -1. That's been the case since long before Go existed. Using `int8` that value prints as `-1` instead of the `255` that everyone expects: https://go.dev/play/p/FqpQXa6JF8I – twotwotwo Jun 14 '23 at 19:15
As far I know I think of 0xFF as 1111 1111 and that can be 255 or -1 depending if it unsigned or not... problably I'm lacking of context or not understanding.. in anycase @twotwotwo I thanks you for the explanation and the patience – Victor Jun 14 '23 at 19:30
1

@Victor Go chose to treat the byte value 0xFF as 255 because people talk about it as 255. [This extended ASCII table](https://www.rapidtables.com/code/text/ascii-table.html), for example, shows ÿ as character 255, not character -1. – twotwotwo Jun 14 '23 at 23:44

score 6 · Answer 2 · answered Jul 12 '14 at 16:00

6

It doesn’t become negative. There are currently 1,114,112 codepoints in Unicode, which is far from 2,147,483,647 (0x7fffffff) – even considering all the reserved blocks.

answered Jul 12 '14 at 16:00

Ry-

218,210
55
464
476

4

Thanks! Though a rune may address a range much larger than needed by unicode at this time, the question is about the fact that a negative value *can* be assigned to a rune. This could have been avoided if it was an unsigned integer. But there may be other considerations that make sense for a rune to still be a signed type, and I wonder what those are. – Jul 12 '14 at 16:10
@TapanKarecha: Sure, but you could also assign a positive value outside of Unicode’s range. Neither one would be valid Unicode. (Negative numbers might be more obvious to check for as an error condition, as a habit taken from C?) – Ry- Jul 12 '14 at 16:23
1

.@false: Yes, there will be invalid values on the positive end of the type range, but having invalid values on both ends of the type range is something I am having trouble dealing with as a concept. As you said, if the type was unsigned, I wont have to worry about checking for the negative value, which is one less check during validation. – Jul 12 '14 at 16:32
@TapanKarecha: No, I was saying that a negative return value on something that ought to return Unicode would be an obvious error (not something that Go needs, but something that you might commonly do in other languages), but checking the positive isn’t convenient at all. Judging by [Unicode’s stability policy](http://unicode.org/policies/stability_policy.html), it might not even be possible. – Ry- Jul 12 '14 at 16:35
8

I think chendesheng's quote gets at the root cause best: Go uses a lot of signed values, not just for runes but array indices, `Read`/`Write` byte counts, etc. That's because `uint`s, in any language, behave confusingly unless you guard every piece of arithmetic against overflow (for example if `var a, b uint = 1, 2`, `a-b > 0` and `a-b > 1000000`: http://play.golang.org/p/lsdiZJiN7V). `int`s behave more like numbers in everyday life, which is a compelling reason to use them, and there is no equally compelling reason not to use them. – twotwotwo Jul 13 '14 at 02:03

score 5 · Answer 3 · edited Mar 31 '21 at 14:09

"Golang, Go : what is rune by the way?" mentioned:

With the recent Unicode 6.3, there are over 110,000 symbols defined. This requires at least 21-bit representation of each code point, so a rune is like int32 and has plenty of bits.

But regarding the overflow or negative value issues, note that the implementation of some of the unicode functions like unicode.IsGraphic do include:

We convert to uint32 to avoid the extra test for negative

Code:

const MaxLatin1 = '\u00FF' // maximum Latin-1 value.

// IsGraphic reports whether the rune is defined as a Graphic by Unicode.
// Such characters include letters, marks, numbers, punctuation, symbols, and
// spaces, from categories L, M, N, P, S, Zs.
func IsGraphic(r rune) bool {
    // We convert to uint32 to avoid the extra test for negative,
    // and in the index we convert to uint8 to avoid the range check.
    if uint32(r) <= MaxLatin1 {
        return properties[uint8(r)]&pg != 0
    }
    return In(r, GraphicRanges...)
}

That may be because a rune is supposed to be constant (as mentioned in "Go rune type explanation", where a rune could be in an int32 or uint32 or even float32 or ...: its constant value authorizes it to be stored in any of those numeric types).

score 4 · Answer 4 · answered Jul 10 '20 at 12:49

The fact that it's allowed a negative value lets you define your own rune sentinel values.

For example:

const EOF rune = -1

func (l *lexer) next() (r rune) {
    if l.pos >= len(l.input) {
        l.width = 0
        return EOF
    }
    r, l.width = utf8.DecodeRuneInString(l.input[l.pos:])
    l.pos += l.width
    return r
}

Seen here in a talk by Rob Pike: Lexical Scanning in Go.

score 2 · Answer 5 · answered Nov 18 '19 at 06:22

In addition to the above answers given, here are my two cents to why Go needed rune.

Strings in GoLang are byte arrays with each character being represented as a single byte. Thus GoLang has a very high-performance advantage when compared to other languages
But since we need a way to represent UTF-8 codepoints which cannot be represented with an 8bit range, we use the rune to represent them.
Why int32 and why not uint32? you may ask. This is made deliberately to detect overflows while doing operations on strings.

this article talks all these in much more details

"each character being represented as a single byte": no, for example "ш" is more than one byte. — Vitaly Zdanevich, Nov 13 '21 at 10:07

Why is rune in golang an alias for int32 and not uint32?

5 Answers5

Linked

Related