187

How can I get the number of characters of a string in Go?

For example, if I have a string "hello" the method should return 5. I saw that len(str) returns the number of bytes and not the number of characters so len("£") returns 2 instead of 1 because £ is encoded with two bytes in UTF-8.

Jonathan Hall
  • 75,165
  • 16
  • 143
  • 189
Ammar
  • 5,070
  • 8
  • 28
  • 27

7 Answers7

238

You can try RuneCountInString from the utf8 package.

returns the number of runes in p

that, as illustrated in this script: the length of "World" might be 6 (when written in Chinese: "世界"), but the rune count of "世界" is 2:

package main
    
import "fmt"
import "unicode/utf8"
    
func main() {
    fmt.Println("Hello, 世界", len("世界"), utf8.RuneCountInString("世界"))
}

Phrozen adds in the comments:

Actually you can do len() over runes by just type casting.
len([]rune("世界")) will print 2. At least in Go 1.3.


And with CL 108985 (May 2018, for Go 1.11), len([]rune(string)) is now optimized. (Fixes issue 24923)

The compiler detects len([]rune(string)) pattern automatically, and replaces it with for r := range s call.

Adds a new runtime function to count runes in a string. Modifies the compiler to detect the pattern len([]rune(string)) and replaces it with the new rune counting runtime function.

RuneCount/lenruneslice/ASCII        27.8ns ± 2%  14.5ns ± 3%  -47.70%
RuneCount/lenruneslice/Japanese     126ns ± 2%   60  ns ± 2%  -52.03%
RuneCount/lenruneslice/MixedLength  104ns ± 2%   50  ns ± 1%  -51.71%

Stefan Steiger points to the blog post "Text normalization in Go"

What is a character?

As was mentioned in the strings blog post, characters can span multiple runes.
For example, an 'e' and '◌́◌́' (acute "\u0301") can combine to form 'é' ("e\u0301" in NFD). Together these two runes are one character.

The definition of a character may vary depending on the application.
For normalization we will define it as:

  • a sequence of runes that starts with a starter,
  • a rune that does not modify or combine backwards with any other rune,
  • followed by possibly empty sequence of non-starters, that is, runes that do (typically accents).

The normalization algorithm processes one character at at time.

Using that package and its Iter type, the actual number of "character" would be:

package main
    
import "fmt"
import "golang.org/x/text/unicode/norm"
    
func main() {
    var ia norm.Iter
    ia.InitString(norm.NFKD, "école")
    nc := 0
    for !ia.Done() {
        nc = nc + 1
        ia.Next()
    }
    fmt.Printf("Number of chars: %d\n", nc)
}

Here, this uses the Unicode Normalization form NFKD "Compatibility Decomposition"


Oliver's answer points to UNICODE TEXT SEGMENTATION as the only way to reliably determining default boundaries between certain significant text elements: user-perceived characters, words, and sentences.

For that, you need an external library like rivo/uniseg, which does Unicode Text Segmentation.

That will actually count "grapheme cluster", where multiple code points may be combined into one user-perceived character.

package uniseg
    
import (
    "fmt"
    
    "github.com/rivo/uniseg"
)
    
func main() {
    gr := uniseg.NewGraphemes("!")
    for gr.Next() {
        fmt.Printf("%x ", gr.Runes())
    }
    // Output: [1f44d 1f3fc] [21]
}

Two graphemes, even though there are three runes (Unicode code points).

You can see other examples in "How to manipulate strings in GO to reverse them?"

‍ alone is one grapheme, but, from unicode to code points converter, 4 runes:

bit
  • 443
  • 1
  • 7
  • 19
VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250
  • 4
    You can see it in action in this string reversion function at http://stackoverflow.com/a/1758098/6309 – VonC Oct 01 '12 at 07:11
  • 5
    This only tells you the number of runes, not the number of glyphs. Many glyphs are made of multiple runes. – Stephen Weinberg Oct 01 '12 at 18:22
  • 6
    Actually you can do len() over runes by just type casting... len([]rune("世界")) will print 2. At leats in Go 1.3, dunno how long has it been. – Phrozen Aug 28 '14 at 00:32
  • 4
    @VonC: Actually, a character (colloquial language term for Glyph) can - occasionally - span several runes, so this answer is, to use the precise technical term, WRONG. What you need is the Grapheme/GraphemeCluster count, not the rune count. For example, an 'e' and '◌́' (acute "\u0301") can combine to form 'é' ("e\u0301" in NFD). But a human would (correctly) regard é as ONE character.. Apparently it makes a difference in Telugu. But probably also French, depending on the keyboard/locale you use. https://blog.golang.org/normalization – Stefan Steiger Jan 27 '16 at 08:03
  • @StefanSteiger Thank you for this comment. I have edited the answer to integrate it as best as I could. Feel free to edit the answer and improve it. – VonC Jan 27 '16 at 08:21
  • I tried this answer and got the wrong result for emojis with different skin color (it counts those emoji's as two characters). – Bjorn Roche Apr 29 '16 at 01:44
  • This answer covers some cases, but not all. See Oliver's answer below. – Justin Johnson Apr 22 '19 at 21:14
  • Great explanation! And just out of curiosity, am I right to interpret string in #golang this way? string type is a special form of byte slice. It cannot be modified(immutable). And it can distinguish runes (or unicode code points) by the rule of UTF-8 so when 'range' is applied to it, string knows how to retrieve one rune at a time. –  Oct 24 '20 at 07:30
  • 1
    @juancortez As explained in https://blog.golang.org/strings, a string is just a slice of byte: it holds arbitrary bytes. It is not required to hold Unicode text, UTF-8 text, or any other predefined format. Nothing "special". https://golang.org/pkg/unicode/utf8/ allows to interpret a string literal as a collection of runes. Which is not enough to reliably determine a character. Hence the need for a Unicode Text Segmentation third-party library, to reliably determine the actual graphemes/glyphs in a string. – VonC Oct 24 '20 at 08:32
  • @VonC Got it! Very articulate! –  Oct 26 '20 at 06:40
55

There is a way to get count of runes without any packages by converting string to []rune as len([]rune(YOUR_STRING)):

package main

import "fmt"

func main() {
    russian := "Спутник и погром"
    english := "Sputnik & pogrom"

    fmt.Println("count of bytes:",
        len(russian),
        len(english))

    fmt.Println("count of runes:",
        len([]rune(russian)),
        len([]rune(english)))

}

count of bytes 30 16

count of runes 16 16

Denis Kreshikhin
  • 8,856
  • 9
  • 52
  • 84
  • This answer has a lot of votes, but it is wrong. Accented characters can consume more than one rune: https://play.golang.com/p/pSmuMMZN9g_t – dimalinux Aug 14 '23 at 04:14
  • @dimalinux there are nothing about characters in my answers, my answer for people who want to get number of runes. However Unicode standard defines accent as "abstract character" so the glyph with accent actually present two unicode characters that equal two runes in golang. – Denis Kreshikhin Aug 15 '23 at 21:33
  • @dimalinux In other words nothing is wrong in my answer. It just require understanding of difference between runes, characters and glyphs for correct use. – Denis Kreshikhin Aug 15 '23 at 21:36
9

I should point out that none of the answers provided so far give you the number of characters as you would expect, especially when you're dealing with emojis (but also some languages like Thai, Korean, or Arabic). VonC's suggestions will output the following:

fmt.Println(utf8.RuneCountInString("️‍")) // Outputs "6".
fmt.Println(len([]rune("️‍"))) // Outputs "6".

That's because these methods only count Unicode code points. There are many characters which can be composed of multiple code points.

Same for using the Normalization package:

var ia norm.Iter
ia.InitString(norm.NFKD, "️‍")
nc := 0
for !ia.Done() {
    nc = nc + 1
    ia.Next()
}
fmt.Println(nc) // Outputs "6".

Normalization is not really the same as counting characters and many characters cannot be normalized into a one-code-point equivalent.

masakielastic's answer comes close but only handles modifiers (the rainbow flag contains a modifier which is thus not counted as its own code point):

fmt.Println(GraphemeCountInString("️‍"))  // Outputs "5".
fmt.Println(GraphemeCountInString2("️‍")) // Outputs "5".

The correct way to split Unicode strings into (user-perceived) characters, i.e. grapheme clusters, is defined in the Unicode Standard Annex #29. The rules can be found in Section 3.1.1. The github.com/rivo/uniseg package implements these rules so you can determine the correct number of characters in a string:

fmt.Println(uniseg.GraphemeClusterCount("️‍")) // Outputs "2".
Oliver
  • 2,184
  • 3
  • 21
  • 24
6

If you need to take grapheme clusters into account, use regexp or unicode module. Counting the number of code points(runes) or bytes also is needed for validaiton since the length of grapheme cluster is unlimited. If you want to eliminate extremely long sequences, check if the sequences conform to stream-safe text format.

package main

import (
    "regexp"
    "unicode"
    "strings"
)

func main() {

    str := "\u0308" + "a\u0308" + "o\u0308" + "u\u0308"
    str2 := "a" + strings.Repeat("\u0308", 1000)

    println(4 == GraphemeCountInString(str))
    println(4 == GraphemeCountInString2(str))

    println(1 == GraphemeCountInString(str2))
    println(1 == GraphemeCountInString2(str2))

    println(true == IsStreamSafeString(str))
    println(false == IsStreamSafeString(str2))
}


func GraphemeCountInString(str string) int {
    re := regexp.MustCompile("\\PM\\pM*|.")
    return len(re.FindAllString(str, -1))
}

func GraphemeCountInString2(str string) int {

    length := 0
    checked := false
    index := 0

    for _, c := range str {

        if !unicode.Is(unicode.M, c) {
            length++

            if checked == false {
                checked = true
            }

        } else if checked == false {
            length++
        }

        index++
    }

    return length
}

func IsStreamSafeString(str string) bool {
    re := regexp.MustCompile("\\PM\\pM{30,}") 
    return !re.MatchString(str) 
}
masakielastic
  • 4,540
  • 1
  • 39
  • 42
  • Thanks for this. I tried your code and it doesn't work for a few emoji graphemes like these: . Any thoughts on how to accurately count those? – Bjorn Roche May 02 '16 at 16:17
  • The compiled regexp should be extracted as `var` outside the functions. – dolmen Jul 26 '16 at 14:00
6

There are several ways to get a string length:

package main

import (
    "bytes"
    "fmt"
    "strings"
    "unicode/utf8"
)

func main() {
    b := "这是个测试"
    len1 := len([]rune(b))
    len2 := bytes.Count([]byte(b), nil) -1
    len3 := strings.Count(b, "") - 1
    len4 := utf8.RuneCountInString(b)
    fmt.Println(len1)
    fmt.Println(len2)
    fmt.Println(len3)
    fmt.Println(len4)

}

pigletfly
  • 1,051
  • 1
  • 16
  • 32
5

Depends a lot on your definition of what a "character" is. If "rune equals a character " is OK for your task (generally it isn't) then the answer by VonC is perfect for you. Otherwise, it should be probably noted, that there are few situations where the number of runes in a Unicode string is an interesting value. And even in those situations it's better, if possible, to infer the count while "traversing" the string as the runes are processed to avoid doubling the UTF-8 decode effort.

zzzz
  • 87,403
  • 16
  • 175
  • 139
  • When would you *not* see a rune as a character? The Go spec defines a rune as a Unicode codepoint: http://golang.org/ref/spec#Rune_literals. – Thomas Kappler Oct 01 '12 at 08:36
  • Also, to avoid doubling the decode effort, I just do a []rune(str), work on that, then convert back to string when I'm done. I think that's easier than keeping track of code points when traversing a string. – Thomas Kappler Oct 01 '12 at 08:38
  • 4
    @ThomasKappler: When? Well, when rune is not a character, which it generally isn't. Only some runes are equal to characters, not all of them. Assuming "rune == character" is valid for a subset of Unicode characters only. Example: http://en.wikipedia.org/wiki/Unicode#Ready-made_versus_composite_characters – zzzz Oct 01 '12 at 08:48
  • @ThomasKappler: but if you look at it that way, then e.g. Java's `String`'s `.length()` method does not return the number of characters either. Neither does Cocoa's `NSString`'s `-length` method. Those simply return the number of UTF-16 entities. But the true number of codepoints is rarely used, because it takes linear time to count it. – newacct Oct 01 '12 at 22:10
0

I tried to make to do the normalization a bit faster:

    en, _ = glyphSmart(data)

    func glyphSmart(text string) (int, int) {
        gc := 0
        dummy := 0
        for ind, _ := range text {
            gc++
            dummy = ind
        }
        dummy = 0
        return gc, dummy
    }
Marcelloh
  • 82
  • 1
  • 5