25

Suppose we have a string with some (astral) Unicode characters:

const s = 'Hi  Unicode!'

The [] operator and .charAt() method don't work for getting the 4th character, which should be "":

> s[3]
'�'
> s.charAt(3)
'�'

The .codePointAt() does get the correct value for the 4th character, but unfortunately it's a number and has to be converted back to a string using String.fromCodePoint():

> String.fromCodePoint(s.codePointAt(3))
''

Similarly, converting the string into an array using splats yields valid Unicode characters, so that's another way of getting the 4th one:

> [...s][3]
''

But i can't believe that going from string to number back to string, or having to split the string into an array are the only ways of doing this seemingly trivial thing. Isn't there a simple method for doing this?

> s.simpleMethod(3)
''

Note: i know that the definition of "character" is somewhat fuzzy, but for the purpose of this question a character is simply the symbol that corresponds to a Unicode codepoint (no combining characters, no grapheme clusters, etc).

Update: the String.fromCodePoint(str.codePointAt(n)) method is not really viable, since the nth position there doesn't take previous astral symbols into account: String.fromCodePoint(''.codePointAt(1)) // => '�'


(I feel kinda dumb asking this; like i'm probably missing something obvious. But previous answers to this questions don't work on strings with Unicode simbols on astral planes.)

epidemian
  • 18,817
  • 3
  • 62
  • 71
  • 3
    Have you seen this page https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/charAt with some code samples? – ivo Sep 11 '17 at 14:30
  • @ivo no, I had't seen that, interesting! The code samples have a "fixed" version of charAt, which is useful, but i was wondering if there was a good method already backed-in on the language – epidemian Sep 11 '17 at 14:51
  • It's Javascript. Simple things cannot be that simple :) – jorgonor Sep 11 '17 at 14:57
  • If jQ is an option for you, it's built in _there_ https://jsfiddle.net/bq2w3fub/ – rndus2r Sep 11 '17 at 15:09
  • 1
    @rndus2r hmm, i don't see how jQ would help here, jQ's text() returns the string as-is, and does not handle astral characters in ant special way it seems: https://jsfiddle.net/epidemian/ha8ydznk/ – epidemian Sep 11 '17 at 16:52

3 Answers3

28

The string iterator is the only thing that iterates through code points rather than UCS-2/UTF-16 code units. So:

const string = 'Hi  Unicode!';
for (const symbol of string) {
  console.log(symbol);
}

So to get a specific code point based on its index from a string:

const string = 'Hi  Unicode!';
// Note: The spread operator uses the string iterator under the hood.
const symbols = [...string]; 
symbols[3]; // ''

Still, this would break with grapheme clusters, or emoji sequences such as ‍‍‍ ( + U+200D ZERO WIDTH JOINER + + U+200D ZERO WIDTH JOINER + + U+200D ZERO WIDTH JOINER + ). Text segmentation helps with that.

Do you actually need to get the 4th code point in the string, though? What’s your use case?

Mathias Bynens
  • 144,855
  • 52
  • 216
  • 248
  • Well, to handle what you characterize as "breaking", and which the OP specifically mentioned he didn't care about, would require specialized logic for individual languages, such as Kannada, which also has complex clusters which can only be composed by quite complex algorithms. –  Sep 11 '17 at 16:44
  • Thanks Mathias! Your article on Unicode is super thorough! Ok, so the array spat method is probably the simplest one then. That's... not too great i guess . Answering your question of actually needing to get the 4th code point: no, my original use case involved getting just the first one. I noticed `str[0]` didn't work for some characters, so i ended up asking myself "wait, how the hell do you get a specific character from a string in JS?", and here we are... – epidemian Sep 11 '17 at 17:04
  • Why is that "emoji sequence" not considered either its own character or a grapheme cluster? – Melab Mar 05 '19 at 02:41
  • @Melab: It is, in terms of [text segmentation](https://mathiasbynens.be/notes/javascript-unicode#other-grapheme-clusters). The problem is that string iteration, `codePointAt`, etc. do not deal with graphemes. – Mathias Bynens Mar 05 '19 at 08:40
  • Use case wise, I'm tokenizing code, and need to create exceptions for illegal characters. The exception message includes the Unicode codepoint of the illegal character. – Carl Smith Sep 25 '20 at 00:08
8

You can use the new u flag to regexp if it's available to you.

const chars = 'Hi  Unicode!'.match(/./ug);
console.log(chars);
0

The accepted answer to this question is out of date.

There is now a member of the String object called .at()/1 which does exactly what you're hoping for. If you have shims, shams, a transcompiler like TypeScript or Babel, etc, just set whatever your local configuration is, and you should be good to go.

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/at

Amusingly, the spec for this feature, as well as the most common implementation shim (the one that I use,) is written by the person who authored the now out-of date accepted answer here. So even when he's out of date, he's still up to date.

If shimming or transcompiling isn't appropriate for you, there's a library called jsesc that can handle it for you through simple escaping. I'll give you three guesses who wrote the library. First two don't count.

https://www.npmjs.com/package/jsesc

John Haugeland
  • 9,230
  • 3
  • 37
  • 40
  • Unless i'm misinterpreting something, this method seems to work the same way as the [] operator, except that it accepts negative indices. `'Hi Unicode!'.at(3)` returns "\ud83d", instead of the expected "" – epidemian Aug 30 '22 at 06:52
  • Yeah, I'm ... I'm pretty surprised by this. "Code unit" is unambiguous and this is the actual purpose of the function. But Firefox agrees? Initially I thought it was because you wrote the emoji in a string literal, and Javascript will actually store two codepoints that way, not a single code unit, but even when you switch it to \u{foo} notation it still does the wrong thing. I do not actually know what's happening here – John Haugeland Aug 31 '22 at 19:25
  • I think what you're referring to is a code point, not a code unit (https://stackoverflow.com/a/27331885/581845). That's why `.codePointAt(3)` will give you the correct code-point, although the index parameter on that function is based on code units, not code points, which is extremely confusing. AFAIK, the purpose of the `.at()` function seems to be to have an equivalent of the `[]` operator that accepts negative indexes, and since the `[]` on strings returns single-code-unit strings, this behavior—though unfortunate—seems consistent with with the existing `[]`. – epidemian Sep 13 '22 at 17:33