A character in a string can be replaced with its corresponding unicode escape sequence. e.g.,
"A" === "\u0041"
However any character above the 0x0000 - 0xFFFF
range (e.g. emojis) needs to be broken down into a "surrogate pair". e.g.,
"" === "\uD83C\uDF2F"
// ^ ^
// A B
//
// A: first half
// B: second half
Which is why "".length === 2
!
Put together these halves print a on screen. However if you split them apart they become these "broken" symbols:
"".split("")
//=> ["�", "�"]
As you worked it out, the � symbol is the same face for many different values. To know what's behind the mask you can simply use String#charCodeAt
e.g.,
"".split("").map(c => c.charCodeAt(0))
//=> [55356, 57135]
Or for their hex values:
"".split("").map(c => c.charCodeAt(0).toString(16))
//=> ["d83c", "df2f"]
How do you detect a broken emoji?
"Hello ".slice(0, -1)
//=> "Hello �"
We can use a regular expression using the u
flag and a unicode property escape to match a surrogate:
const broken_emoji = /\p{Surrogate}/u;
broken_emoji.test("Hello ");
//=> false
broken_emoji.test("Hello ".slice(0, -1));
//=> true
Reading further