320

What is a rune in Go?

I've been googling but Golang only says in one line: rune is an alias for int32.

But how come integers are used all around like swapping cases?

The following is a function swapcase. What is all the <= and -?

And why doesn't switch have any arguments?

&& should mean and but what is r <= 'z'?

func SwapRune(r rune) rune {
    switch {
    case 'a' <= r && r <= 'z':
        return r - 'a' + 'A'
    case 'A' <= r && r <= 'Z':
        return r - 'A' + 'a'
    default:
        return r
    }
}

Most of them are from http://play.golang.org/p/H6wjLZj6lW

func SwapCase(str string) string {
    return strings.Map(SwapRune, str)
}

I understand this is mapping rune to string so that it can return the swapped string. But I do not understand how exactly rune or byte works here.

hippietrail
  • 15,848
  • 18
  • 99
  • 158
  • Sidenote: This doesn't do what younger readers might want it to do for the [English word "café"](https://en.oxforddictionaries.com/definition/caf%C3%A9) and [others](https://en.wikipedia.org/wiki/English_terms_with_diacritical_marks) - let alone other languages. Go has libraries with decent support for actually useful variants of this kind of transformation. – RedGrittyBrick Aug 24 '18 at 16:06
  • 13
    In case anyone wants to know where the word "rune" came from: https://en.wikipedia.org/wiki/Runic_(Unicode_block) – Matt Browne Sep 20 '18 at 18:51
  • A `[]rune` can be set to a boolean, numeric, or string type. See https://stackoverflow.com/a/62739051/12817546. –  Jul 09 '20 at 07:51

10 Answers10

230

Rune literals are just 32-bit integer values (however they're untyped constants, so their type can change). They represent unicode codepoints. For example, the rune literal 'a' is actually the number 97.

Therefore your program is pretty much equivalent to:

package main

import "fmt"

func SwapRune(r rune) rune {
    switch {
    case 97 <= r && r <= 122:
        return r - 32
    case 65 <= r && r <= 90:
        return r + 32
    default:
        return r
    }
}

func main() {
    fmt.Println(SwapRune('a'))
}

It should be obvious, if you were to look at the Unicode mapping, which is identical to ASCII in that range. Furthermore, 32 is in fact the offset between the uppercase and lowercase codepoint of the character. So by adding 32 to 'A', you get 'a' and vice versa.

Inanc Gumus
  • 25,195
  • 9
  • 85
  • 101
topskip
  • 16,207
  • 15
  • 67
  • 99
  • 20
    This obviously works only for ASCII characters and not for accended characters such as 'ä', let alone more complicated cases like the 'ı' (U+0131). Go has special functions to map to lower case such as `unicode.ToLower(r rune) rune`. – topskip Oct 11 '13 at 06:06
  • 3
    And to add to @topskip's correct answer with a SwapCase function that works for all codepoints and not just a-z: `func SwapRune(r rune) rune { if unicode.IsUpper(r) { r = unicode.ToLower(r) } else { r = unicode.ToUpper(r) }; return r }` – ANisus Oct 11 '13 at 06:33
  • 29
    Runes are int32 values. That's the entire answer. They're not _"mapped"_. – thwd Oct 11 '13 at 11:38
  • @AlixAxel : The behavior of SimpleFold is essentially the same (It also uses ToLower and ToUpper for most runes). There are some cases where it differs such as: DZ->Dz, Dz->dz, dz->DZ. My SwapRune would instead go: DZ->dz, Dz->DZ, dz->DZ. I like your suggestion better :) – ANisus Feb 10 '14 at 07:36
  • 7
    So runes are similar to C chars? – Kenny Worden Feb 23 '17 at 16:54
  • 1
    @KennyWorden Runes are 32-bit which means one rune can hold any unicode character. However, c chars I believe are typically only 8-bit which means one char can only represent a character in the extended-ascii range – David Callanan Jul 30 '19 at 09:28
  • Clarifying answer. I quoted you here https://stackoverflow.com/a/62739051/12817546. –  Jul 07 '20 at 08:58
  • https://play.golang.org/p/HhZ9FM1ksPv this can also help – Alok Kumar Singh Sep 01 '21 at 07:02
90

From the Go lang release notes: http://golang.org/doc/go1#rune

Rune is a Type. It occupies 32bit and is meant to represent a Unicode CodePoint. As an analogy the english characters set encoded in 'ASCII' has 128 code points. Thus is able to fit inside a byte (8bit). From this (erroneous) assumption C treated characters as 'bytes' char, and 'strings' as a 'sequence of characters' char*.

But guess what. There are many other symbols invented by humans other than the 'abcde..' symbols. And there are so many that we need 32 bit to encode them.

In golang then a string is a sequence of bytes. However, since multiple bytes can represent a rune code-point, a string value can also contain runes. So, it can be converted to a []rune, or vice versa.

The unicode package http://golang.org/pkg/unicode/ can give a taste of the richness of the challenge.

Inanc Gumus
  • 25,195
  • 9
  • 85
  • 101
fabrizioM
  • 46,639
  • 15
  • 102
  • 119
  • 6
    With the recent Unicode 6.3, there are over 110,000 symbols defined. This requires at least 21-bit representation of each code point, so a `rune` is like `int32` and has plenty of bits. – Rick-777 Oct 12 '13 at 12:08
  • 2
    You say "a `string` is a sequence of `rune`s" - I don't think that's true? [Go blog](https://blog.golang.org/strings): "a string is just a bunch of bytes"; [Go lang spec](https://golang.org/ref/spec#String_types): "A string value is a (possibly empty) sequence of bytes" – Chris Martin May 16 '16 at 22:52
  • 1
    I'm still confused, so is string an array of runes or an array of bytes? Are they interchangeable? – gogofan Aug 11 '17 at 09:30
  • strings in go are made up of runes and not bytes, refer this https://mymemorysucks.wordpress.com/2017/05/03/a-short-guide-to-mastering-strings-in-golang/ – prvn Mar 23 '18 at 07:23
  • 1
    @prvn That's wrong. It's like saying an image is not a sequence of bytes, it's a sequence of pixels. But, actually, underneath, it's a series of bytes. **A string is a series of bytes, not runes.** Please read the [spec](https://golang.org/ref/spec#String_types). – Inanc Gumus Aug 26 '18 at 15:30
  • @InancGumus, yes, but when you say image is a sequence a pixel, it is distinctive and gives more clarity. So when we say strings are made up of runes, users can visualize clearly that in order to run the entire length of string they need to take steps in 'runes' and not 'bytes'. – prvn Aug 27 '18 at 08:16
  • 2
    @prvn But, you can't say `not bytes`. Then, you might say: "Strings are made up of runes and runes made up of bytes" Something like that. Then again. it's not completely true. – Inanc Gumus Aug 27 '18 at 09:20
  • 1
    Persuasive answer. I quoted you here https://stackoverflow.com/a/62739051/12817546. –  Jul 07 '20 at 08:56
  • 1
    @prvn are arrays of bytes. This means that if you index into a string using `[n]`, you get the nth _byte_, not the nth _rune_. You have to use one of the Go standard library's many Unicode-encoding-aware functions to scan the string if you want to find the Nth rune. – Mark Reed Nov 16 '21 at 15:39
67

I have tried to keep my language simple so that a layman understands rune.

A rune is a character. That's it.

It is a single character. It's a character from any alphabet from any language from anywhere in the world.

To get a string we use

double-quotes ""

OR

back-ticks ``

A string is different than a rune. In runes we use

single-quotes ''

Now a rune is also an alias for int32...Uh What?

The reason rune is an alias for int32 is because we see that with coding schemes such as below enter image description here

each character maps to some number and so it's the number that we are storing. For example, a maps to 97 and when we store that number it's just the number and so that's way rune is an alias for int32. But is not just any number. It is a number with 32 'zeros and ones' or '4' bytes. (Note: UTF-8 is a 4-byte encoding scheme)

How runes relate to strings?

A string is a collection of runes. In the following code:

    package main

    import (
        "fmt"
    )

    func main() {
        fmt.Println([]byte("Hello"))
    }

We try to convert a string to a stream of bytes. The output is:

[72 101 108 108 111]

We can see that each of the bytes that makes up that string is a rune.

Suhail Gupta
  • 22,386
  • 64
  • 200
  • 328
  • 14
    `A string is not a collection of runes` this is not correct strictly speaking. Instead, string is a byte slice, encoded with utf8. Each char in string actually takes 1 ~ 3 bytes, while each rune takes 4 bytes. You can convert between string and []rune, but they are different. – Eric Jul 31 '18 at 10:08
  • 7
    Rune is not a character, a rune represents a unicode codepoint. And a codepoint doesn't necessarily point to one character. – Inanc Gumus Oct 10 '18 at 14:18
  • Worth to add that "a rune is also an alias for int32" yes, but it doesn't mean it's useful for poor-man compression... If you hit something like 55296 the string conversion goes astray: [Go Playground](https://play.golang.org/p/XFVKayUhV27) – kubanczyk Nov 24 '19 at 00:55
  • Note: UTF-8 is _not_ a 4-byte encoding scheme; I believe you're thinking about Unicode code points (which are 32 bits). The beauty of UTF-8 is that _each character takes as few bytes as needed_, or, in other words, each character has a _variable_ size. Characters up to 127 (i.e. ASCII) are just encoded in a single byte. All characters on the old ANSI code pages will take 2 bytes. And so forth — up to 6 bytes (for some complex emojis with variants, for instance). That means that "Hello" just takes 5 bytes, in ASCII _and_ UTF-8. – Gwyneth Llewelyn Jul 10 '23 at 21:38
39

(Got a feeling that the above answers still didn't state the differences & relationships between string and []rune very clearly, so I would try to add another answer with an example.)

As @Strangework's answer said, string and []rune are quite different.

Differences - string & []rune:

  • string value is a read-only byte slice. And, a string literal is encoded in utf-8. Each char in string actually takes 1 ~ 3 bytes, while each rune takes 4 bytes
  • For string, both len() and index are based on bytes.
  • For []rune, both len() and index are based on rune (or int32).

Relationships - string & []rune:

  • When you convert from string to []rune, each utf-8 char in that string becomes a rune.
  • Similarly, in the reverse conversion, when converting from []rune to string, each rune becomes a utf-8 char in the string.

Tips:

  • You can convert between string and []rune, but still they are different, in both type & overall size.

(I would add an example to show that more clearly.)


Code

string_rune_compare.go:

// string & rune compare,
package main

import "fmt"

// string & rune compare,
func stringAndRuneCompare() {
    // string,
    s := "hello你好"

    fmt.Printf("%s, type: %T, len: %d\n", s, s, len(s))
    fmt.Printf("s[%d]: %v, type: %T\n", 0, s[0], s[0])
    li := len(s) - 1 // last index,
    fmt.Printf("s[%d]: %v, type: %T\n\n", li, s[li], s[li])

    // []rune
    rs := []rune(s)
    fmt.Printf("%v, type: %T, len: %d\n", rs, rs, len(rs))
}

func main() {
    stringAndRuneCompare()
}

Execute:

go run string_rune_compare.go

Output:

hello你好, type: string, len: 11
s[0]: 104, type: uint8
s[10]: 189, type: uint8

[104 101 108 108 111 20320 22909], type: []int32, len: 7

Explanation:

  • The string hello你好 has length 11, because the first 5 chars each take 1 byte only, while the last 2 Chinese chars each take 3 bytes.

    • Thus, total bytes = 5 * 1 + 2 * 3 = 11
    • Since len() on string is based on bytes, thus the first line printed len: 11
    • Since index on string is also based on bytes, thus the following 2 lines print values of type uint8 (since byte is an alias type of uint8, in go).
  • When converting the string to []rune, it found 7 utf8 chars, thus 7 runes.

    • Since len() on []rune is based on rune, thus the last line printed len: 7.
    • If you operate []rune via index, it will access base on rune.
      Since each rune is from a utf8 char in the original string, thus you can also say both len() and index operation on []rune are based on utf8 chars.
involtus
  • 682
  • 7
  • 21
Eric
  • 22,183
  • 20
  • 145
  • 196
  • "For string, both len() and index are based on bytes." Could you explain that a little more? When I do `fmt.Println("hello你好"[0])` it returns the actual UTF-8 code point instead of bytes. – Julian Oct 13 '18 at 11:32
  • @Julian Please take a look at the output of the program in the answer, for `s[0]`, it print `s[0]: 104, type: uint8`, the type is `uint8`, means its a byte. For ASCII chars like `h` utf-8 also use a single byte to represent it, so the code point is the same as the single byte; but for chinese chars like `你`, it use 3 bytes. – Eric Oct 13 '18 at 18:06
  • Clarifying example. I quoted you here https://stackoverflow.com/a/62739051/12817546. –  Jul 07 '20 at 08:49
38

I do not have enough reputation to post a comment to fabrizioM's answer, so I will have to post it here instead.

Fabrizio's answer is largely correct, and he certainly captured the essence of the problem - though there is a distinction which must be made.

A string is NOT necessarily a sequence of runes. It is a wrapper over a 'slice of bytes', a slice being a wrapper over a Go array. What difference does this make?

A rune type is necessarily a 32-bit value, meaning a sequence of values of rune types would necessarily have some number of bits x*32. Strings, being a sequence of bytes, instead have a length of x*8 bits. If all strings were actually in Unicode, this difference would have no impact. Since strings are slices of bytes, however, Go can use ASCII or any other arbitrary byte encoding.

String literals, however, are required to be written into the source encoded in UTF-8.

Source of information: http://blog.golang.org/strings

informatik01
  • 16,038
  • 10
  • 74
  • 104
Strangework
  • 545
  • 1
  • 5
  • 7
  • 2
    Good point ! Each rune requires 4 bytes, but each character in string is encoded with utf8, thus onlly 1 ~ 3 bytes at most. – Eric Jul 31 '18 at 08:39
  • Well, 1 ~ 6, to be more precise (think about complex emoji variants). In practice, it's reasonable to assume that European languages other than English will take a bit more than 1 (since accented characters will require 2 bytes), while non-European languages using non-Latin alphabets/ideograms will require many more bytes. But it's unlikely that even Klingon with lots of emojis will take 6 bytes for _every_ code point :-) – Gwyneth Llewelyn Jul 10 '23 at 21:41
9

Everyone else has covered the part related to runes, so I am not going to talk about that.

However, there is also a question related to switch not having any arguments. This is simply because in Golang, switch without an expression is an alternate way to express if/else logic. For example, writing this:

t := time.Now()
switch {
case t.Hour() < 12:
    fmt.Println("It's before noon")
default:
    fmt.Println("It's after noon")
}

is same as writing this:

t := time.Now()
if t.Hour() < 12 {
    fmt.Println("It's before noon")
} else {
    fmt.Println("It's after noon")
}

You can read more here.

Shashank Goyal
  • 153
  • 1
  • 5
  • Ha! Thanks. I keep forgetting that and always write things like `switch true { ... }` which is rather stupid. You could also add that the `switch` keyword, in Go, is meant to be something allowing multiple, chained if-else constructs, each with its own condition to satisfy, which, however, are visually much more compact to represent with Go's `switch`... – Gwyneth Llewelyn Jul 10 '23 at 21:45
3

A rune is an int32 value, and therefore it is a Go type that is used for representing a Unicode code point. A Unicode code point or code position is a numerical value that is usually used for representing single Unicode characters;

Remario
  • 3,813
  • 2
  • 18
  • 25
2

Program

package main

import (
    "fmt"
)

func main() {
    words := "€25 or less"
    fmt.Println("as string slice")
    fmt.Println(words, len(words))

    runes := []rune(words)
    fmt.Println("\nas []rune slice")
    fmt.Printf("%v, len:%d\n", runes, len(runes))

    bytes := []byte(words)
    fmt.Println("\nas []byte slice")
    fmt.Printf("%v, len:%d\n", bytes, len(bytes))
}

Output

as string slice
€25 or less 13

as []rune slice
[8364 50 53 32 111 114 32 108 101 115 115], len:11

as []byte slice
[226 130 172 50 53 32 111 114 32 108 101 115 115], len:13

As you can see, the euro symbol '€' is represented by 3 bytes - 226, 130 & 172. The rune represents a character - any character be it hieroglyphics. The 32 bits of a rune is sufficient to represent all the characters in the world as of today. Hence, the rune representation of a euro symbol '€' is 8364.

For ASCII characters, which are 128, a byte (8 bits) is sufficient. Hence, a rune and a byte representation of digits or alphabets are the same. E.g: 2 is represented by 50.

A byte representation of a string is always greater than or equal to its rune representation in length since certain characters are represented by more than a byte but within 32 bits, which is a rune.

https://play.golang.org/p/y93woDLs4Qe

dpaks
  • 375
  • 1
  • 13
1

rune is an alias for int32 and is equivalent to int32 in all ways. It is used to distinguish character values from integer values.

l = 108, o = 111

Clement Olaniyan
  • 333
  • 2
  • 11
1

Rune is an alias for the int32 type. It represents a single Unicode code point. The Unicode Consortium assigns numeric values, called code points to over one million unique characters. For example, 65 is code point for letter A, 66 -> B (source : Get Programming with Go)

  • Here is a link to the go source code where rune is defined: https://cs.opensource.google/go/go/+/master:src/builtin/builtin.go;l=92?q=type%20rune&ss=go%2Fgo – JessG Aug 06 '22 at 20:03