23

I am trying to count "characters" in go. That is, if a string contains one printable "glyph", or "composed character" (or what someone would ordinarily think of as a character), I want it to count 1. For example, the string "Hello, 世界", should count 11, since there are 11 characters, and a human would look at this and say there are 11 glyphs.

utf8.RuneCountInString() works well in most cases, including ascii, accents, asian characters and even emojis. However, as I understand it runes correspond to code points, not characters. When I try to use basic emojis it works, but when I use emojis that have different skin tones, I get the wrong count: https://play.golang.org/p/aFIGsB6MsO

From what I read here and here the following should work, but I still don't seem to be getting the right results (it over-counts):

func CountCharactersInString(str string) int {
    var ia norm.Iter
    ia.InitString(norm.NFC, str)
    nc := 0
    for !ia.Done() {
        nc = nc + 1
        ia.Next()
    }
    return nc
}

This doesn't work either:

func GraphemeCountInString(str string) int {
    re := regexp.MustCompile("\\PM\\pM*|.")
    return len(re.FindAllString(str, -1))
}

I am looking for something similar to this in Objective C:

+ (NSInteger)countCharactersInString:(NSString *) string {
    // --- Calculate the number of characters enterd by user and update character count label
    NSInteger count = 0;
    NSUInteger index = 0;
    while (index < string.length) {
        NSRange range = [string rangeOfComposedCharacterSequenceAtIndex:index];
        count++;
        index += range.length;
    }
    return count;
 }
Community
  • 1
  • 1
Bjorn Roche
  • 11,279
  • 6
  • 36
  • 58
  • You're looking for an implementation of the ["Grapheme Cluster Boundary" algorithm from UAX #29](http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries). – 一二三 Apr 30 '16 at 00:37
  • I believe that's right. I tried both implementations for grapheme counting from this answer http://stackoverflow.com/a/26728555/547291, but I run into the same trouble, but perhaps grapheme cluster boundary counting is more what I want? – Bjorn Roche May 02 '16 at 16:22
  • The answers to that question confuse "grapheme clusters" with "character normalisation" (all have serious errors in them). – 一二三 May 03 '16 at 00:47
  • Were you able to find a solution to this? The problem is the skin-tone modifier is being counted as a separate character and norm does not "count" it as 1 character with the hand. – F21 Jan 25 '18 at 04:16
  • Never found a correct solution, so I had to loosen my requirements. – Bjorn Roche Jan 25 '18 at 15:01

5 Answers5

15

Straight forward natively use the utf8.RuneCountInString()

package main

import (
    "fmt"
    "unicode/utf8"
)

func main() {
    str := "Hello, 世界"
    fmt.Println("counts =", utf8.RuneCountInString(str))
}
mvndaai
  • 3,453
  • 3
  • 30
  • 34
0xFK
  • 2,433
  • 32
  • 24
13

I wrote a package that allows you to do this: https://github.com/rivo/uniseg. It breaks strings according to the rules specified in Unicode Standard Annex #29 which is what you are looking for. Here is how you would use it in your case:

package main

import (
    "fmt"

    "github.com/rivo/uniseg"
)

func main() {
    fmt.Println(uniseg.GraphemeClusterCount("Hello, 世界"))
}

This will print 11 as you expect.

Oliver
  • 2,184
  • 3
  • 21
  • 24
  • Best solution. All the other solutions result in counting some emojis as 1 character and other emojis as 2 characters. – darkstar May 31 '22 at 23:56
  • 1
    There is a difference between bytes, runes, and graphemes, and it seems many people confuse the three. (In most use cases, it doesn't matter anyway.) For example, ️‍ (rainbow flag emoji) is 1 grapheme, 4 runes, and 14 bytes. The Go stdlib only has built-in functions for bytes and runes but not for graphemes. – Oliver Jun 02 '22 at 06:20
11

Have you tried strings.Count?

package main

import (
     "fmt"
     "strings"
 )

 func main() {
     fmt.Println(strings.Count("Hello, 世界", "")) // Returns 2
 }
p_mcp
  • 2,643
  • 8
  • 36
  • 75
  • In the example "Hello, 世界", I would want it to count 11, since there are 11 characters, not 2. I will edit my question to clarify. – Bjorn Roche Apr 29 '16 at 14:03
4

Reference to the example of API document. https://golang.org/pkg/unicode/utf8/#example_DecodeLastRuneInString

package main

import (
    "fmt"
    "unicode/utf8"
)

func main() {
    str := "Hello, 世界"
    count := 0
    for len(str) > 0 {
        r, size := utf8.DecodeLastRuneInString(str)
        count++
        fmt.Printf("%c %v\n", r, size)

        str = str[:len(str)-size]
    }
    fmt.Println("count:",count)
}
Jiang YD
  • 3,205
  • 1
  • 14
  • 20
  • 2
    That counts *runes*, not *graphemes*: `str := ""` counts 2 instead of 1. – 一二三 Apr 29 '16 at 06:42
  • what "AX" is and why it should be 1? – Jiang YD Apr 29 '16 at 06:47
  • 1
    It's `U+1F1E6 U+1F1FD`, which should render as the flag of the Åland Islands. Any other regional indicator symbol will have the same result (perhaps `` renders better on your system?). – 一二三 Apr 29 '16 at 07:40
  • but `U+1F1E6` and `U+1F1FD` can be two separate characters too, am I right? – Jiang YD Apr 29 '16 at 08:28
  • 1
    Yes, but in a regional indicator sequence they form one grapheme (or "one printable 'glyph'" as the original question put it). – 一二三 Apr 29 '16 at 08:42
  • Apparently there is a 'unicode/norm' package to normalize unicode grapheme, is that what's needed here : https://blog.golang.org/normalization ? – phtrivier Apr 29 '16 at 09:01
  • how could we will think a colorful flag picture is a "glyph" or a "character"? And I find the is Objective C function rangeOfComposedCharacterSequenceAtIndex @Bjorn Roche used plays different in different system(http://stackoverflow.com/questions/32831455/different-results-of-rangeofcomposedcharactersequenceatindex-in-playground-and-i). I'm totally confused by the complex Emoji! – Jiang YD Apr 29 '16 at 09:25
  • @phtrivier, yes, the examples I gave in my question use the unicode/norm package, but I still get the wrong answer sometimes, such as for the glyph. – Bjorn Roche Apr 29 '16 at 14:09
  • there is a standard function - utf8.RuneCountInString – feech Nov 11 '20 at 02:28
-2

I think the easiest way to do this would be like this:

package main

import "fmt"

func main() {
    str := "Hello, 世界"
    var counter int
    for range str {
        counter++
    }
    fmt.Println(counter)
}

This one prints 11