6

Normally I would just use something like str[i].

But what if str = "☀️"?

str[i] fails. for (x of str) console.log(x) also fails. It prints out a total of 4 characters, even though there are clearly only 2 emoji in the string.

What's the best way to iterate over every character I can see in a string (and newlines, I guess), and nothing else?

The ideal solution would return an array of 2 characters: the 2 emoji, and nothing else. The claimed duplicate, and a bunch of other solutions I've found, don't fit this criteria.

hippietrail
  • 15,848
  • 18
  • 99
  • 158
thedayturns
  • 9,723
  • 5
  • 33
  • 41
  • 2
    I think you should check this blog post : [link](https://mathiasbynens.be/notes/javascript-unicode) – msencer Apr 22 '16 at 04:49
  • 2
    Possible duplicate of [Split JavaScript string into array of codepoints? (taking into account "surrogate pairs" but not "grapheme clusters")](http://stackoverflow.com/questions/21397316/split-javascript-string-into-array-of-codepoints-taking-into-account-surrogat) – Raymond Chen Apr 22 '16 at 04:59
  • Are you saying you want to capture the emoji, or skip over it and find the next "normal" character? – KevBot Apr 22 '16 at 05:00
  • @RaymondChen your suggested answer appears to be a polyfill for the `for...of` syntax which I pointed out does not work in this case. But please correct me if I'm wrong! – thedayturns Apr 22 '16 at 09:32
  • @KevBot I would like to capture the emoji as a single character. Essentially **if I can select it as a single character, I'd like to capture it as a single character.** – thedayturns Apr 22 '16 at 09:33
  • The suggested answer says "`for..of` cannot be polyfilled." The suggested answers shows how to split a string into code points. If you don't want to polyfill it, then just use it as a free function. – Raymond Chen Apr 22 '16 at 14:23
  • @RaymondChen My desired answer should **only be 2 characters in length** (both emojis and nothing else). The `toCodePoints` function returns an array of length 4. – thedayturns Apr 22 '16 at 19:57
  • First of all, your original statement is incorrect. the `for (x in str) console.log(x)` prints six characters (plus additional junk not relevant to the discussion), not the four you originally claimed. That's because the string `"☀️"` is six code units long: `"\u2600\ufe0f\ud83d\ude4c\ud83c\udffc"`. This breaks down into four code points: U+2600 (BLACK SUN WITH RAYS), U+FE0F (VARIANT SELECTOR 16), U+1F64C (PERSON RAISING BOTH HANDS IN CELEBRATION), and U+1F3FC (EMOJI MODIFIER FITZPATRICK TYPE 3). It sounds like you are looking to break into graphemes, which is a harder problem. – Raymond Chen Apr 22 '16 at 22:38
  • @RaymondChen I said `for (x of str)` not `x in str` specifically because `of` breaks on code points rather than characters. Graphemes turned out to be the magic word here though - once I googled for that I quickly found a decent library to get the job done. – thedayturns Apr 22 '16 at 23:40
  • See my solution posted under a different question that doesn't take Astral characters/Surrogate pairs into account: https://stackoverflow.com/questions/1966476/javascript-process-each-letter-of-text/36392879#36392879 – hippietrail Jul 05 '17 at 09:23

2 Answers2

2

I eventually found the answer in the form of this insane JS library:

https://github.com/orling/grapheme-splitter

thedayturns
  • 9,723
  • 5
  • 33
  • 41
1

You need to make your own methods for astral characters.

"foobar".match(/[\uD800-\uDBFF][\uDC00-\uDFFF]|./g);
// => ["f", "o", "o", "", "b", "a", "r"]
Amadan
  • 191,408
  • 23
  • 240
  • 301
  • This does not work in all cases. Consider `"foob☀️ar".match(/[\uD800-\uDBFF][\uDC00-\uDFFF]|./g);`. – thedayturns Apr 22 '16 at 23:31
  • @thedayturns: Yeah, I only covered astral characters, which is where JavaScript "mistakenly" splits a single Unicode character into two JS characters. The emptyish string there is a VARIATION SELECTOR 16 (U+FE0F), which is a separate Unicode character, but combines with the previous; a similar issue would be all the combining characters like COMBINING ACUTE ACCENT (U+0301). So to solve *that* problem, you would need a whole library, which is outside the scope of a StackOverflow answer. – Amadan Apr 23 '16 at 13:35