21

For long time we used naive approach to split strings in JS:

someString.split('');

But popularity of emoji forced us to change this approach - emoji characters (and other non-BMP characters) like are made of two "characters'.

String.fromCodePoint(128514).split(''); // array of 2 characters; can't embed due to StackOverflow limitations

So what is modern, correct and performant approach to this task?

Ginden
  • 5,149
  • 34
  • 68
  • I'm curious. Which StackOverflow limitations are you talking about? – Mr Lister Feb 05 '16 at 11:44
  • It seems like I couldn't post question with result of `JSON.stringify(String.fromCodePoint(128514).split(''))` expression - it caused "Malformed URI" error thrown from jQuery and disallowed to post question. – Ginden Feb 05 '16 at 11:48
  • @MrLister: [I have added Meta post](http://meta.stackexchange.com/questions/274191/cant-post-result-alone-surrogates-because-of-jquery-raising-malformed-uri-bug). – Ginden Feb 05 '16 at 11:56
  • 1
    see https://mathiasbynens.be/notes/javascript-unicode for the big picture – georg Feb 05 '16 at 11:57

5 Answers5

22

Using spread in array literal :

const str = "";
console.log([...str]);

Using for...of :

function split(str){
  const arr = [];
  for(const char of str)
    arr.push(char)
   
  return arr;
}

const str = "";
console.log(split(str));
Omkar76
  • 1,317
  • 1
  • 8
  • 22
  • 6
    I think it should be noted that this unfortunately still doesn't cover a lot of emojis currently in use. E.g. `[...'‍♀️']` becomes `["", "", "‍", "♀", "️"]`. Which means no e.g. straightforward string reversal or symbol-wise comparison is possible. – AndyO Aug 26 '21 at 10:50
  • 1
    See https://github.com/orling/grapheme-splitter as an example library, mind the open issues regarding zero-width-joiners. Maybe there's a newer library out there. – Manuel Nov 06 '21 at 17:50
10

The best approach to this task is to use native String.prototype[Symbol.iterator] that's aware of Unicode characters. Consequently clean and easy approach to split Unicode character is Array.from used on string, e.g.:

const string = String.fromCodePoint(128514, 32, 105, 32, 102, 101, 101, 108, 32, 128514, 32, 97, 109, 97, 122, 105, 110, 128514);
Array.from(string);
Ginden
  • 5,149
  • 34
  • 68
6

JavaScript has a new API (part of ES2023) called Intl.Segmenter that allows you to split strings based on graphemes (the user-perceived characters of a string). With this API, your split might look like so:

const split = (str) => {
  const itr = new Intl.Segmenter("en", {granularity: 'grapheme'}).segment(str);
  return Array.from(itr, ({segment}) => segment);
}
// See browser console for output
console.log(split('')); // ['']
console.log(split('é')); // ['é']
console.log(split('‍‍')); // ['‍‍']
console.log(split('❤️')); // ['❤️']
console.log(split('‍♀️')); // ['‍♀️']
<p>See browser console for logs</p>

This allows you to not only deal with emojis consisting of two code points such as , but other characters also such as composite characters (eg: ), characters separated by ZWJs (eg: ‍‍), characters with variation selectors (eg: ❤️), characters with emoji modifiers (eg: ‍♀️) etc. all of which can't be handled by invoking the iterator of strings (by using spread ..., for..of, Symbol.iterator etc.) as seen in the other answers, as these will only iterate the code points of your string.

Nick Parsons
  • 45,728
  • 6
  • 46
  • 64
5

A flag was introduced in ECMA 2015 to support unicode awareness in regex.

Adding u to your regex returns the complete character in your result.

const withFlag = `ABDE`.match(/./ug);
const withoutFlag = `ABDE`.match(/./g);

console.log(withFlag, withoutFlag);

There's a little more about it here

robstarbuck
  • 6,893
  • 2
  • 41
  • 40
0

I did something like this somewhere I had to support older browsers and a ES5 minifier, probably will be useful to other

    if (Array.from && window.Symbol && window.Symbol.iterator) {
        array = Array.from(input[window.Symbol.iterator]());
    } else {
        array = ...; // maybe `input.split('');` as fallback if it doesn't matter
    }
Ebrahim Byagowi
  • 10,338
  • 4
  • 70
  • 81