4

Im working with emojis and I get into situations where emojis can split into several parts (because emojis have >1 length) and I end up with � symbol(s).

How to get the real (string) value of it?


If I understand it correctly, � symbol is a generic "broken" symbol that could have different value depending on situation. E.g the following hard-coded comparison doesn't work because while myVar would log out � symbol, underlaying value/string is different:

if (myVar === "�") // ...enter code here
Solo
  • 6,687
  • 7
  • 35
  • 67

1 Answers1

2

A character in a string can be replaced with its corresponding unicode escape sequence. e.g.,

"A" === "\u0041"

However any character above the 0x0000 - 0xFFFF range (e.g. emojis) needs to be broken down into a "surrogate pair". e.g.,

"" === "\uD83C\uDF2F"
//         ^     ^
//         A     B
//
// A: first half
// B: second half

Which is why "".length === 2!

Put together these halves print a on screen. However if you split them apart they become these "broken" symbols:

"".split("")
//=>  ["�", "�"]

As you worked it out, the � symbol is the same face for many different values. To know what's behind the mask you can simply use String#charCodeAt e.g.,

"".split("").map(c => c.charCodeAt(0))
//=> [55356, 57135]

Or for their hex values:

"".split("").map(c => c.charCodeAt(0).toString(16))
//=> ["d83c", "df2f"]

How do you detect a broken emoji?

"Hello ".slice(0, -1)
//=> "Hello �"

We can use a regular expression using the u flag and a unicode property escape to match a surrogate:

const broken_emoji = /\p{Surrogate}/u;

broken_emoji.test("Hello ");
//=> false

broken_emoji.test("Hello ".slice(0, -1));
//=> true

Reading further

customcommander
  • 17,580
  • 5
  • 58
  • 84