7

I ran into an issue with counting unicode characters. I need to count total combined unicode characters.

Take this character for example:

द्ध

if you use .length property on this string it gives you 3. Which is technically correct as it is a combination of

, and

However, put द्धin a text area and then you realize by using arrow keys that it is considered as one character. Only if you use backspace you realize that there are 3 characters.

Edit: Also for your test case please consider that it could be a word. It could be something like,

द्धद्द

This should give 2 with .length, but gives 6

This is a problem when you want to get or set the current caret position in input elements.

pewpewlasers
  • 3,025
  • 4
  • 31
  • 58
  • 1
    So, you want the number of logical characters instead of UTF-16 codeunits? – Deduplicator Aug 13 '14 at 17:53
  • 1
    Interesting question. What's more is that in Python, `len("द्ध")` is 9 and `len("द")` is 3. – Charles Clayton Aug 13 '14 at 17:55
  • @Deduplicator yeah something like that. I am working on something and need to get the caret position on a textarea. But since the length is not properly read, I am stumped at accurately getting the position. – pewpewlasers Aug 13 '14 at 17:59
  • In chrome, if I try to put the carret after this character, I actually need to count it with a length of 3: http://jsfiddle.net/c68rwwut/ . So I guess this problem isn't actually a problem. – Volune Aug 13 '14 at 18:04
  • Googling for "javascript grapheme count" gives good results, like this: https://mathiasbynens.be/notes/javascript-unicode – Deduplicator Aug 13 '14 at 18:19
  • @Volune actually I need to replace a word in a textarea with these characters. But after I do so, I want to keep the caret at where the user was before. But this position needs to be slightly changed. Now to estimate the change i need to know how long these unicode characters were. – pewpewlasers Aug 13 '14 at 18:21
  • Instead of saying character, where there are multiple incompatible definitions corresponding to different levels of abstraction, consider using terms uniquely referring to one such definition, like codeunit, codepoint, grapheme and grapheme-cluster. – Deduplicator Aug 13 '14 at 18:22
  • @pewpewlasers: What do you need caret position for, apart from remembering it and setting it back? Have you tried using raw UTf-16 unit counts as returned by `.length`? – Karol S Aug 13 '14 at 19:04
  • 2
    Question should be reopened, or at least linked elsewhere. This is about counting graphemes and the "duplicate" is about byte-length, totally unrelated. – Coderer Feb 11 '21 at 15:28
  • I've changed the duplicate to point to something more suitable than [the bytes question](https://stackoverflow.com/q/5515869/102441) – Eric May 19 '22 at 20:39

1 Answers1

8

Your example “द्ध” is a string of three Unicode characters, and the length property correctly indicates this.

What you apparently to want to count is “characters” in some other sense, something like “what a speaker of a language intuitively sees as one character”. This is a vague and mutable concept. The Unicode standard annex UAX #29 Unicode Text Segmentation tries to analyze the concept, calling it “grapheme cluster”, and describes some algorithms on working with it.

Unfortunately, JavaScript has no built-in tools for recognizing whether a character is e.g. combining mark and this should be regarded as part of a cluster. However, if you can limit yourself to handling just one writing system, you can probably code the operations manually, referring to possible Unicode characters by their code numbers.

Moreover, if the intent is to make the count match the way some input editor works (e.g. how the arrow keys more over characters), you would need to know the logic of that editor. It may implement Unicode grapheme clusters in some sense, or something else.

Jukka K. Korpela
  • 195,524
  • 37
  • 270
  • 390
  • 2
    For anybody finding this in the far-flung future: there's bad news, good news, and more bad news. Bad: still no built-in way to count graphemes. Good: [there's a package](https://www.npmjs.com/package/grapheme-splitter) that implements the linked UAX. Bad: it's over 200k, even if you just want to count the number of graphemes in a single string. – Coderer Feb 11 '21 at 15:30
  • The package do not count exactly my expectation like the heart, it doesn't count correctly. – user3856437 Aug 01 '23 at 04:21