10

I'm trying to get the length of a javascript string in user-visible graphemes, ie ignoring combining characters (and surrogate pairs?). Is this possible, and if so, how would I go about it?

We're using the dojo toolkit on our project, but any general javascript solution would be great.

Angus
  • 123
  • 1
  • 8
  • 1
    Answers to this question: http://stackoverflow.com/q/3744721/1352254 include the useful info that javascript uses UCS-2 instead of UTF-16, and indicate that this won't be possible. – Angus Apr 24 '12 at 14:54
  • It will be possible, it just won't be easy because you'll have to deal with some low-level Unicode issues. – hippietrail Jan 28 '14 at 05:11
  • Duplicate of https://stackoverflow.com/questions/24531751/how-can-i-split-a-string-containing-emoji-into-an-array – Rúnar Berg Jul 08 '22 at 19:10
  • 1
    Does this answer your question? [How can I split a string containing emoji into an array?](https://stackoverflow.com/questions/24531751/how-can-i-split-a-string-containing-emoji-into-an-array) – Stefnotch Mar 20 '23 at 11:58

5 Answers5

8

Here is a pure JavaScript library that does just that:

https://github.com/orling/grapheme-splitter

It implements the Unicode UAX-29 standard in all its edge cases that you're likely to miss in a home-brew solution, like non-Latin diacritics, Hangul (Korean) jamo characters, emoji, multiple combining marks, etc.

Orlin Georgiev
  • 1,391
  • 16
  • 18
2

Split string to array

Then count

let arr = [..."⛔"] // ["", "", "", "⛔", "", "", ""]
let len = arr.lenght

Credit to downGoat

Note that this solution won't work in some special cases, such as commented below were one smiley is composed by four: [..."‍‍‍"] -> ['', '‍', '', '‍', '', '‍', '']

Though I posted it here for Google searches as for most cases it works, and it is much easier then all other alternatives.

Full solution

To overcome special emojis as the one above, one can search for the connection charecter and make some modifications. The char code for this is 8205 (UTF-16). Here is how to do it:

let myStr = "‍‍‍"
let arr = [...myStr]

for (i = arr.length-1; i--; i>= 0){
    if (arr[i].charCodeAt(0) == 8205) { // find & handle special combination character
        arr[i-1] += arr[i] + arr[i+1];
        arr.splice(i, 2)
    }
}
console.log(arr.length) //2

Haven't found a case where this doesn't work. Comment if you do

lior bakalo
  • 440
  • 2
  • 9
2

Use Intl.Segmenter.

The Intl.Segmenter object enables locale-sensitive text segmentation, enabling you to get meaningful items (graphemes, words or sentences) from a string.

[...new Intl.Segmenter().segment('️‍⚧️️‍‍❤️‍')].length;
//=> 3

"️‍⚧️️‍‍❤️‍".length
//=> 24

[..."️‍⚧️️‍‍❤️‍"].length
//=> 17

As of March 2023 Intl.Segmenter is available in Node, Chrome and Safari, but not in Firefox (see availability table, polyfill available here).

Rúnar Berg
  • 4,229
  • 1
  • 22
  • 38
1

For the combining characters, look at the Derived Combining Class that lists all combining characters (among others). Since you're just interested in counting, you could just nuke them out -- leaves you with a slightly closer estimation.

In the post linked to by Angus, JavaScript strings outside of the BMP shows code to deal with surrogates. But the code actually does the contrary of what you want -- it splits the 0x10000+ codepoints into two codepoints. As far as JS is concerned it's one codepoint -- albeit a truncated one. Who cares? You're counting them, not displaying...

BUT, there's another category of codepoints you might want to deal with too, the non-printable characters. Anything under 0x20 of course, but there's plenty of others -- look at the 0x2000 range for instance. These are not visible either and should not be included in your count.

Community
  • 1
  • 1
dda
  • 6,030
  • 2
  • 25
  • 34
  • Thanks for the info, I didn't notice at the time that the linked question had example code, I had looked over it and thought JS just couldn't handle the low-level string stuff that would be necessary. – Angus Sep 06 '12 at 21:19
0

This open-source CoffeeScript implementation seems to work decently enough: https://github.com/devongovett/grapheme-breaker (if only it wasn't CS )

TooTallNate
  • 1,477
  • 3
  • 20
  • 41