Counting characters in golang string

Question

I am trying to count "characters" in go. That is, if a string contains one printable "glyph", or "composed character" (or what someone would ordinarily think of as a character), I want it to count 1. For example, the string "Hello, 世界", should count 11, since there are 11 characters, and a human would look at this and say there are 11 glyphs.

utf8.RuneCountInString() works well in most cases, including ascii, accents, asian characters and even emojis. However, as I understand it runes correspond to code points, not characters. When I try to use basic emojis it works, but when I use emojis that have different skin tones, I get the wrong count: https://play.golang.org/p/aFIGsB6MsO

From what I read here and here the following should work, but I still don't seem to be getting the right results (it over-counts):

func CountCharactersInString(str string) int {
    var ia norm.Iter
    ia.InitString(norm.NFC, str)
    nc := 0
    for !ia.Done() {
        nc = nc + 1
        ia.Next()
    }
    return nc
}

This doesn't work either:

func GraphemeCountInString(str string) int {
    re := regexp.MustCompile("\\PM\\pM*|.")
    return len(re.FindAllString(str, -1))
}

I am looking for something similar to this in Objective C:

+ (NSInteger)countCharactersInString:(NSString *) string {
    // --- Calculate the number of characters enterd by user and update character count label
    NSInteger count = 0;
    NSUInteger index = 0;
    while (index < string.length) {
        NSRange range = [string rangeOfComposedCharacterSequenceAtIndex:index];
        count++;
        index += range.length;
    }
    return count;
 }

You're looking for an implementation of the ["Grapheme Cluster Boundary" algorithm from UAX #29](http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries). — 一二三, Apr 30 '16 at 00:37
I believe that's right. I tried both implementations for grapheme counting from this answer http://stackoverflow.com/a/26728555/547291, but I run into the same trouble, but perhaps grapheme cluster boundary counting is more what I want? — Bjorn Roche, May 02 '16 at 16:22
The answers to that question confuse "grapheme clusters" with "character normalisation" (all have serious errors in them). — 一二三, May 03 '16 at 00:47
Were you able to find a solution to this? The problem is the skin-tone modifier is being counted as a separate character and norm does not "count" it as 1 character with the hand. — F21, Jan 25 '18 at 04:16
Never found a correct solution, so I had to loosen my requirements. — Bjorn Roche, Jan 25 '18 at 15:01

score 15 · Answer 1 · edited Dec 01 '20 at 17:24

15

Straight forward natively use the utf8.RuneCountInString()

package main

import (
    "fmt"
    "unicode/utf8"
)

func main() {
    str := "Hello, 世界"
    fmt.Println("counts =", utf8.RuneCountInString(str))
}

edited Dec 01 '20 at 17:24

mvndaai

3,453
3
30
34

answered Oct 31 '20 at 11:32

0xFK

2,433
32
24

2

or even more straight with utf8.RuneCountInString – feech Nov 11 '20 at 02:29
1

Thanks for modification @mvndaai RuneCountInString is like RuneCount but its input is a string instead of byte. – 0xFK Dec 02 '20 at 07:16
This is best answer cause it's used internal utf8 library instead of external – Serhii Polishchuk Feb 28 '21 at 00:18
1

Go doesn't need a package to understand unicode. Just make sure you count runes and not bytes; `len([]rune("Hello, 世界"))`. – Ferdy Pruis Jun 24 '21 at 09:04

Oliver · Accepted Answer · 2019-03-13T20:50:21.417

13

I wrote a package that allows you to do this: https://github.com/rivo/uniseg. It breaks strings according to the rules specified in Unicode Standard Annex #29 which is what you are looking for. Here is how you would use it in your case:

package main

import (
    "fmt"

    "github.com/rivo/uniseg"
)

func main() {
    fmt.Println(uniseg.GraphemeClusterCount("Hello, 世界"))
}

This will print 11 as you expect.

edited Mar 13 '19 at 20:50

answered Mar 13 '19 at 17:54

Oliver

2,184
3
21
24

Best solution. All the other solutions result in counting some emojis as 1 character and other emojis as 2 characters. – darkstar May 31 '22 at 23:56
1

There is a difference between bytes, runes, and graphemes, and it seems many people confuse the three. (In most use cases, it doesn't matter anyway.) For example, ️‍ (rainbow flag emoji) is 1 grapheme, 4 runes, and 14 bytes. The Go stdlib only has built-in functions for bytes and runes but not for graphemes. – Oliver Jun 02 '22 at 06:20

score 11 · Answer 3 · answered Apr 29 '16 at 13:42

11

Have you tried strings.Count?

package main

import (
     "fmt"
     "strings"
 )

 func main() {
     fmt.Println(strings.Count("Hello, 世界", "")) // Returns 2
 }

answered Apr 29 '16 at 13:42

p_mcp

2,643
8
36
75

In the example "Hello, 世界", I would want it to count 11, since there are 11 characters, not 2. I will edit my question to clarify. – Bjorn Roche Apr 29 '16 at 14:03

score 4 · Answer 4 · answered Apr 29 '16 at 02:23

4

Reference to the example of API document. https://golang.org/pkg/unicode/utf8/#example_DecodeLastRuneInString

package main

import (
    "fmt"
    "unicode/utf8"
)

func main() {
    str := "Hello, 世界"
    count := 0
    for len(str) > 0 {
        r, size := utf8.DecodeLastRuneInString(str)
        count++
        fmt.Printf("%c %v\n", r, size)

        str = str[:len(str)-size]
    }
    fmt.Println("count:",count)
}

answered Apr 29 '16 at 02:23

Jiang YD

3,205
1
14
20

2

That counts *runes*, not *graphemes*: `str := ""` counts 2 instead of 1. – 一二三 Apr 29 '16 at 06:42
what "AX" is and why it should be 1? – Jiang YD Apr 29 '16 at 06:47
1

It's `U+1F1E6 U+1F1FD`, which should render as the flag of the Åland Islands. Any other regional indicator symbol will have the same result (perhaps `` renders better on your system?). – 一二三 Apr 29 '16 at 07:40
but `U+1F1E6` and `U+1F1FD` can be two separate characters too, am I right? – Jiang YD Apr 29 '16 at 08:28
1

Yes, but in a regional indicator sequence they form one grapheme (or "one printable 'glyph'" as the original question put it). – 一二三 Apr 29 '16 at 08:42
Apparently there is a 'unicode/norm' package to normalize unicode grapheme, is that what's needed here : https://blog.golang.org/normalization ? – phtrivier Apr 29 '16 at 09:01
how could we will think a colorful flag picture is a "glyph" or a "character"? And I find the is Objective C function rangeOfComposedCharacterSequenceAtIndex @Bjorn Roche used plays different in different system(http://stackoverflow.com/questions/32831455/different-results-of-rangeofcomposedcharactersequenceatindex-in-playground-and-i). I'm totally confused by the complex Emoji! – Jiang YD Apr 29 '16 at 09:25
@phtrivier, yes, the examples I gave in my question use the unicode/norm package, but I still get the wrong answer sometimes, such as for the glyph. – Bjorn Roche Apr 29 '16 at 14:09
there is a standard function - utf8.RuneCountInString – feech Nov 11 '20 at 02:28

score -2 · Answer 5 · answered Oct 01 '20 at 18:14

-2

I think the easiest way to do this would be like this:

package main

import "fmt"

func main() {
    str := "Hello, 世界"
    var counter int
    for range str {
        counter++
    }
    fmt.Println(counter)
}

This one prints 11

answered Oct 01 '20 at 18:14

Nice Developer

1

Counting characters in golang string

5 Answers5