Old question but I should warn:
Do NOT use .split('')
You'll get weird results with non-BMP (non-Basic-Multilingual-Plane) character sets.
Reason is that methods like .split()
and .charCodeAt()
only respect the characters with a code point below 65536; bec. higher code points are represented by a pair of (lower valued) "surrogate" pseudo-characters.
''.length // —> 6
''.split('') // —> ["�", "�", "�", "�", "�", "�"]
''.length // —> 2
''.split('') // —> ["�", "�"]
Use ES2015 (ES6) features where possible:
Using the spread operator:
let arr = [...str];
Or Array.from
let arr = Array.from(str);
Or split
with the new u
RegExp flag:
let arr = str.split(/(?!$)/u);
Examples:
[...''] // —> ["", "", ""]
[...''] // —> ["", "", ""]
For ES5, options are limited:
I came up with this function that internally uses MDN example to get the correct code point of each character.
function stringToArray() {
var i = 0,
arr = [],
codePoint;
while (!isNaN(codePoint = knownCharCodeAt(str, i))) {
arr.push(String.fromCodePoint(codePoint));
i++;
}
return arr;
}
This requires knownCharCodeAt()
function and for some browsers; a String.fromCodePoint()
polyfill.
if (!String.fromCodePoint) {
// ES6 Unicode Shims 0.1 , © 2012 Steven Levithan , MIT License
String.fromCodePoint = function fromCodePoint () {
var chars = [], point, offset, units, i;
for (i = 0; i < arguments.length; ++i) {
point = arguments[i];
offset = point - 0x10000;
units = point > 0xFFFF ? [0xD800 + (offset >> 10), 0xDC00 + (offset & 0x3FF)] : [point];
chars.push(String.fromCharCode.apply(null, units));
}
return chars.join("");
}
}
Examples:
stringToArray('') // —> ["", "", ""]
stringToArray('') // —> ["", "", ""]
Note: str[index]
(ES5) and str.charAt(index)
will also return weird results with non-BMP charsets. e.g. ''.charAt(0)
returns "�"
.
UPDATE: Read this nice article about JS and unicode.