How to get � symbol real value?

Question

Im working with emojis and I get into situations where emojis can split into several parts (because emojis have >1 length) and I end up with � symbol(s).

How to get the real (string) value of it?

If I understand it correctly, � symbol is a generic "broken" symbol that could have different value depending on situation. E.g the following hard-coded comparison doesn't work because while myVar would log out � symbol, underlaying value/string is different:

if (myVar === "�") // ...enter code here

See the answer in this question: https://stackoverflow.com/questions/2670037/how-to-remove-invalid-utf-8-characters-from-a-javascript-string — Diodeus - James MacFarlane, Apr 03 '19 at 15:27
Why are you splitting the characters and how are you doing it ? — jo_va, Apr 03 '19 at 16:28
@Solo Because I use sublime text and emojis are actually graphically rendered by the editor as a single character — GrafiCode, Apr 03 '19 at 17:48

customcommander · Accepted Answer · 2020-09-29T00:12:21.767

A character in a string can be replaced with its corresponding unicode escape sequence. e.g.,

"A" === "\u0041"

However any character above the 0x0000 - 0xFFFF range (e.g. emojis) needs to be broken down into a "surrogate pair". e.g.,

"" === "\uD83C\uDF2F"
//         ^     ^
//         A     B
//
// A: first half
// B: second half

Which is why "".length === 2!

Put together these halves print a on screen. However if you split them apart they become these "broken" symbols:

"".split("")
//=>  ["�", "�"]

As you worked it out, the � symbol is the same face for many different values. To know what's behind the mask you can simply use String#charCodeAt e.g.,

"".split("").map(c => c.charCodeAt(0))
//=> [55356, 57135]

Or for their hex values:

"".split("").map(c => c.charCodeAt(0).toString(16))
//=> ["d83c", "df2f"]

How do you detect a broken emoji?

"Hello ".slice(0, -1)
//=> "Hello �"

We can use a regular expression using the u flag and a unicode property escape to match a surrogate:

const broken_emoji = /\p{Surrogate}/u;

broken_emoji.test("Hello ");
//=> false

broken_emoji.test("Hello ".slice(0, -1));
//=> true

Reading further

https://mathiasbynens.be/notes/javascript-unicode

How to get � symbol real value?

1 Answers1

How do you detect a broken emoji?