-1

Suppose we have a UTF-8 string (represented by a string of hex bytes in character form) that might include an emoji, or any other Unicode characters. How do we represent the string as a literal in JavaScript for use in the alert function? In PHP, there are two easy ways: (1) "\xE2\x96\xB6" (2) hex2bin('E296B6'). I'm having trouble doing the same thing in pure JavaScript. '\xE2\x96\xB6' doesn't seem to work (it displays a paragraph mark instead of a right solid triangle in an alert function).

I thought of writing a 'hex2bin' function to return the argument as a hex byte string, but JavaScript has no such datatype. In PHP, strings can contain any bit patterns, but I don't think this is true for JavaScript.

I know that JavaScript is a modern language that supports Unicode, so there must be an easy way to do this.

Note that any answer that talks about the \u construct is wrong, since \u does not accept a UTF-8 string. UTF-8 is currently the standard and recommended for most storage of character strings, yet most programming languages do not yet offer simple literal syntax for UTF-8 byte strings.

When programmers talk about low-level representations for Unicode, they are frequently interested in UTF-8, since it is the standard and an efficient encoding. UTF-16 and Unicode code points (and the many odd encodings) are of interest, particularly for naming characters (U+HHHH notation) and for representing them in fixed widths. But it is UTF-8 that is the standard, and we need more answers on Stack Overflow to help us with UTF-8.

David Spector
  • 1,520
  • 15
  • 21
  • 1
    Does this answer your question? [Insert Unicode character into JavaScript](https://stackoverflow.com/questions/13093126/insert-unicode-character-into-javascript) – sebastian-ruehmann Jun 18 '20 at 19:25
  • Just write `'▶'`. (Or `'\u25B6'` if you don't trust your file encoding.) – Bergi Jun 18 '20 at 21:01
  • `\u25B6`looks like a Unicode code point (the actual index in the list of all Unicode characters). This is not UTF-8. The whole advantage of UTF-8 is that it uses the fewest bytes to encode every Unicode character, even the ones with one byte, or six or more bytes. – David Spector Jun 19 '20 at 20:36
  • @Bergi closed this question as a duplicate. He is wrong but gives me no way to reply to him. The question he cites has only Unicode code point answers (\u) and does not deal with UTF-8 at all. – David Spector Jun 19 '20 at 20:45
  • @DavidSpector You're replying just fine in the comments to me :-) Your question was "*How do we represent the string as a literal in JavaScript for use in the alert function?*", and the answer to that is to encode your script as UTF-8 and directly embed `'▶'`, as the duplicate target points out. It's unclear what else you are looking for. Maybe [edit] your question to clarify what you mean by "*we have a UTF-8 string*". Where do you have that and what do you want to do with it? – Bergi Jun 19 '20 at 21:07
  • "*strings can contain any bit patterns, but I don't think this is true for JavaScript.*" - not exactly. Strings in javascript are charcode sequences, and *can* contain any bit patterns in 16-bit values (UCS16). If you are looking for individual bytes, use an `Uint8Array`. – Bergi Jun 19 '20 at 21:11
  • My question asked for UTF-8 literals (constructed as usual by hex bytes). The selected answer does that perfectly. Your "duplicate question" has nothing to do with UTF-8. However, now that I know that ordinary JavaScript strings are allowed to contain any bit patterns, I can write a function to take a character string containing hex bytes as characters and turn it into a string of hex bytes. The \u notation is only useful for UTF-16 code points, which cannot express all of Unicode. Also, what I have in any UTF-8 file (for example) is bytes, not actual characters, which is why I can't use ▶. – David Spector Jun 20 '20 at 00:11
  • @Bergi, please remove the notice you added at the top. It is not relevant. – David Spector Jun 20 '20 at 00:21
  • @DavidSpector There is no such thing as an "UTF-8 literal" in JavaScript, there are no strings containing individual bytes (unless you only use the LSB)."*what I have in any UTF-8 file is bytes*" - that's exactly what `▶` is. You just put these bytes into the JS source file. Btw, [ES6 introduced `\u{…}` escapes](https://mathiasbynens.be/notes/javascript-escapes#unicode-code-point) which can represent all unicode code points. – Bergi Jun 20 '20 at 13:26
  • @DavidSpector I've edited the duplicate links to something more relevant for your use case – Bergi Jun 20 '20 at 13:27
  • They are NOT duplicates and NOT relevant. They both deal with converting one datatype to another, not with UTF-8 literals. A literal is a way to write a UTF-8 string in a JavaScript program without using UTF-8 in the program. There are several possible reasons why this could be needed. Please delete the "duplicate question" heading from this question, thanks. And yes, I know there is only a "Unicode" literal and no UTF-8 literal in JavaScript as yet. That is why I asked for a workaround and got it (see Answer 1). (Aside: why do we have to fight so hard when we use Stack Overflow?) – David Spector Jun 21 '20 at 14:53

1 Answers1

1

You could use decodeURIComponent, which recognises UTF8 hex codes, when prefixed with "%":

console.log(decodeURIComponent("%E2%96%B6"));
trincot
  • 317,000
  • 35
  • 244
  • 286
  • Thanks1 This is exactly what I'm looking for. This method will work with any UTF-8 encoded string as a literal. Perfect. Now all they have to do is build a simpler version right into a future version of JavaScript. – David Spector Jun 19 '20 at 20:39