14

I have this string in java:

"test.message"

byte[] bytes = plaintext.getBytes("UTF-8");
//result: [116, 101, 115, 116, 46, 109, 101, 115, 115, 97, 103, 101]

If I do the same thing in javascript:

    stringToByteArray: function (str) {         
        str = unescape(encodeURIComponent(str));

        var bytes = new Array(str.length);
        for (var i = 0; i < str.length; ++i)
            bytes[i] = str.charCodeAt(i);

        return bytes;
    },

I get:

 [7,163,140,72,178,72,244,241,149,43,67,124]

I was under the impression that the unescape(encodeURIComponent()) would correctly translate the string to UTF-8. Is this not the case?

Reference:

http://ecmanaut.blogspot.be/2006/07/encoding-decoding-utf8-in-javascript.html

2 Answers2

19

You can use TextEncoder which is part of the Encoding Living Standard. According to the Encoding API entry from the Chromium Dashboard, it shipped in Firefox and will ship in Chrome 38. There is also a text-encoding polyfill available.

The JavaScript code sample below returns a Uint8Array filled with the values you expect.

var s = "test.message";
var encoder = new TextEncoder();
encoder.encode(s);
// [116, 101, 115, 116, 46, 109, 101, 115, 115, 97, 103, 101]
Kevin Hakanson
  • 41,386
  • 23
  • 126
  • 155
  • And, then to get the total bytes, like Java's `.getBytes()`? Add values in array? i.e. `Array.from(new TextEncoder().encode('some delicious cookie')).reduce((acc, current) => acc + current, 0)` – Neil Gaetano Lindberg Jun 04 '21 at 19:02
  • This answer is from 2014 and should be updated to note that a polyfill is no longer needed and the api is supported on all current browsers: https://developer.mozilla.org/en-US/docs/Web/API/TextEncoder – dcow Oct 12 '21 at 18:51
10

JavaScript has no concept of character encoding for String, everything is in UTF-16. Most of time time the value of a char in UTF-16 matches UTF-8, so you can forget it's any different.

There are more optimal ways to do this but

function s(x) {return x.charCodeAt(0);}
"test.message".split('').map(s);
// [116, 101, 115, 116, 46, 109, 101, 115, 115, 97, 103, 101]

So what is unescape(encodeURIComponent(str)) doing? Let's look at each individually,

  1. encodeURIComponent is converting every character in str which is illegal or has a meaning in URI Syntax into a URI escaped version so that there is no problem using it as a key or value in the search component of a URI, for example encodeURIComponent('&='); // "%26%3D" Notice how this is now a 6 character long String.
  2. unescape is actually depreciated, but it does a similar job to decodeURI or decodeURIComponent (the reverse of encodeURIComponent). If we look in the ES5 spec we can see 11. Let c be the character whose code unit value is the integer represented by the four hexadecimal digits at positions k+2, k+3, k+4, and k+5 within Result(1).
    So, 4 digits is 2 bytes is "UTF-8", however as I mentioned, all Strings are UTF-16, so it's really a UTF-16 string limiting itself to UTF-8.
Paul S.
  • 64,864
  • 9
  • 122
  • 138
  • I cannot forget it's any different as I need support for chinese. –  Apr 04 '14 at 11:48
  • btw if you read this they suggest unescape(encodeUricomponent()) to get utf8 value from utf16: http://ecmanaut.blogspot.be/2006/07/encoding-decoding-utf8-in-javascript.html –  Apr 04 '14 at 11:49
  • So, is there a solution? –  Apr 04 '14 at 11:59
  • @Wesley I should have actually tested your code; I can't actually reproduce the "wrong" result you go, I get the same as you expected, and when I try to reverse your weird output I get `"£H²Hôñ+C|"` – Paul S. Apr 04 '14 at 12:02
  • Are you serving the page as _UTF-8_? I'm starting to think maybe you're serving the page in a different character encoding which doesn't support all your characters and then want to convert the malformed strings in that into _UTF-8_. (This will be exceedingly difficult as the browser does a _Stream -> String (in Stream's encoding) -> UTF-16_ conversion before _JavaScript_ sees it. – Paul S. Apr 04 '14 at 12:08
  • Thanks, that was it. Headers were being overwritten. –  Apr 04 '14 at 12:44
  • Incorrect, JavaScript spec uses UCS-2 which is similar to UTF-16 but does not behave the same all the time. See https://mathiasbynens.be/notes/javascript-encoding and https://mathiasbynens.be/notes/javascript-unicode for excellent discourses on the matter – PixnBits Nov 06 '14 at 20:31