How to Convert UTF8 ArrayBuffer to UTF16 JavaScript String

Question

The answers from here got me started on how to use the ArrayBuffer:

Converting between strings and ArrayBuffers

However, they have quite a bit of different approaches. The main one is this:

function ab2str(buf) {
  return String.fromCharCode.apply(null, new Uint16Array(buf));
}

function str2ab(str) {
  var buf = new ArrayBuffer(str.length*2); // 2 bytes for each char
  var bufView = new Uint16Array(buf);
  for (var i=0, strLen=str.length; i<strLen; i++) {
    bufView[i] = str.charCodeAt(i);
  }
  return buf;
}

I wanted to clarify though the difference between UTF8 and UTF16 encoding, because I'm not 100% sure this is correct.

So in JavaScript, in my understanding, all strings are UTF16 encoded. But the raw bytes you might have in your own ArrayBuffer can be in any encoding.

So say that I have provided an ArrayBuffer to the browser from an XMLHttpRequest, and those bytes from the backend are in UTF8 encoding:

var r = new XMLHttpRequest()
r.open('GET', '/x', true)
r.responseType = 'arraybuffer'
r.onload = function(){
  var b = r.response
  if (!b) return
  var v = new Uint8Array(b)
}
r.send(null)

So now we have the ArrayBuffer b from the response r in the Uint8Array view v.

The question is, if I want to convert this into a JavaScript string, what to do.

From my understanding, the raw bytes we have in v are encoded in UTF8 (and were sent to the browser encoded in UTF8). If we were to do this though, I don't think it would work right:

function ab2str(buf) {
  return String.fromCharCode.apply(null, new Uint16Array(buf));
}

From my understanding of the fact that we are in UTF8, and JavaScript strings are in UTF16, you need to do this instead:

function ab2str(buf) {
  return String.fromCharCode.apply(null, new Uint8Array(buf));
}

So using Uint8Array instead of Uint16Array. That is the first question, how to go from utf8 bytes -> js string.

The second question is how now to go back to UTF8 bytes from a JavaScript string. That is, I am not sure this would encode right:

function str2ab(str) {
  var buf = new ArrayBuffer(str.length*2); // 2 bytes for each char
  var bufView = new Uint16Array(buf);
  for (var i=0, strLen=str.length; i<strLen; i++) {
    bufView[i] = str.charCodeAt(i);
  }
  return buf;
}

I am not sure what to change in this one though, to get back to a UTF8 ArrayBuffer. Something like this seems incorrect:

function str2ab(str) {
  var buf = new ArrayBuffer(str.length*2); // 2 bytes for each char
  var bufView = new Uint8Array(buf);
  for (var i=0, strLen=str.length; i<strLen; i++) {
    bufView[i] = str.charCodeAt(i);
  }
  return buf;
}

Anyways, I am just trying to clarify how exactly to go from UTF8 bytes, which are encoding a string from the backend, to a UTF16 JavaScript string on the frontend.

"*`String.fromCharCode.apply(null, new Uint8Array(buf))`*" - no, that only works for ASCII strings. You'll need a proper [`TextDecoder`](https://developer.mozilla.org/en-US/docs/Web/API/TextDecoder) (and a [`TextEncoder`](https://developer.mozilla.org/en-US/docs/Web/API/TextEncoder) for reversal). — Bergi, Feb 04 '23 at 20:31

score 2 · Answer 1 · answered Feb 04 '23 at 18:47

We need some assumptions to understand what happened:

1. JS uses UTF-16

First of all, js uses UTF-16 for storing symbols as it mention here in section unicode strings: https://developer.mozilla.org/en-US/docs/Web/API/btoa

2. UTF-16 and UTF-8

UTF-8 and UTF-16 doesn't mean that one symbol represents by one byte or two bytes. UTF-8 such as utf-16 is a variable-length encoding.

3. ArrayBuffer and encodings

"hello" by one byte (Uint8Array): [104, 101, 108, 108, 111]
the same by two bytes (Uint16Array): [0, 104, 0, 101, 0, 108, 0, 108, 0, 111]

There is no encoding in ArrayBuffer because ArrayBuffer represent numbers.

Iteration over second array will be different from iteration over the first array. You know that the two-byte number cannot be pack onto one-byte number.

When you receive response from server in utf-8 - you receive it as sequence of bytes and if data you receive are stored by one-byte per symbol your code will work fine - it works with symbols like [a-zA-Z0-9] and common punctuation symbols. But if you receive a symbol that stores with two bytes in UTF-8 the transcription onto UTF-16 will be incorrect:

0xC3 0xA5 (one symbol å) -> 0x00 0xC3 0x00 0xA5 (two symbols "Ã¥")

So if you will not transfer symbols outside the range of latin symbols, digits and punctuation you can use your code and it will work fine even though it is not correct.

score 1 · Answer 2 · answered Feb 04 '23 at 19:00

1

Why not use the TextDecoder interface instead of rolling your own? Are you limited to a browser that doesn't support it?

const decoder = new TextDecoder('UTF-8')
const dataStr = decoder.decode(dataBuf)

answered Feb 04 '23 at 19:00

Mark Reed

91,912
16
138
175

How to Convert UTF8 ArrayBuffer to UTF16 JavaScript String

2 Answers2

Linked