UTF-16 Byte Array
JavaScript encodes strings as UTF-16, just like C#'s UnicodeEncoding
, so creating a byte array is relatively straightforward.
JavaScript's charCodeAt()
returns a 16-bit code unit (aka a 2-byte integer between 0 and 65535). You can split it into distinct bytes using the following:
function strToUtf16Bytes(str) {
const bytes = [];
for (ii = 0; ii < str.length; ii++) {
const code = str.charCodeAt(ii); // x00-xFFFF
bytes.push(code & 255, code >> 8); // low, high
}
return bytes;
}
For example:
strToUtf16Bytes('');
// [ 60, 216, 53, 223 ]
This works between C# and JavaScript because they both support UTF-16. However, if you want to get a UTF-8 byte array from JS, you must transcode the bytes.
UTF-8 Byte Array
The solution feels somewhat non-trivial, but I used the code below in production with great success (original source).
Also, for the interested reader, I published my unicode helpers that help me work with string lengths reported by other languages such as PHP.
/**
* Convert a string to a unicode byte array
* @param {string} str
* @return {Array} of bytes
*/
export function strToUtf8Bytes(str) {
const utf8 = [];
for (let ii = 0; ii < str.length; ii++) {
let charCode = str.charCodeAt(ii);
if (charCode < 0x80) utf8.push(charCode);
else if (charCode < 0x800) {
utf8.push(0xc0 | (charCode >> 6), 0x80 | (charCode & 0x3f));
} else if (charCode < 0xd800 || charCode >= 0xe000) {
utf8.push(0xe0 | (charCode >> 12), 0x80 | ((charCode >> 6) & 0x3f), 0x80 | (charCode & 0x3f));
} else {
ii++;
// Surrogate pair:
// UTF-16 encodes 0x10000-0x10FFFF by subtracting 0x10000 and
// splitting the 20 bits of 0x0-0xFFFFF into two halves
charCode = 0x10000 + (((charCode & 0x3ff) << 10) | (str.charCodeAt(ii) & 0x3ff));
utf8.push(
0xf0 | (charCode >> 18),
0x80 | ((charCode >> 12) & 0x3f),
0x80 | ((charCode >> 6) & 0x3f),
0x80 | (charCode & 0x3f),
);
}
}
return utf8;
}