94

I have an ArrayBuffer which contains a string encoded using UTF-8 and I can't find a standard way of converting such ArrayBuffer into a JS String (which I understand is encoded using UTF-16).

I've seen this code in numerous places, but I fail to see how it would work with any UTF-8 code points that are longer than 1 byte.

return String.fromCharCode.apply(null, new Uint8Array(data));

Similarly, I can't find a standard way of converting from a String to a UTF-8 encoded ArrayBuffer.

Tom Leese
  • 19,309
  • 12
  • 45
  • 70
  • @LightStyle Thanks, completely missed that spelling mistake! :P – Tom Leese Jun 19 '13 at 13:06
  • 1
    `var uintArray = new Uint8Array("string".split('').map(function(char) {return char.charCodeAt(0);}));` – Niccolò Campolungo Jun 19 '13 at 13:10
  • It that is what you need I can explain you in an answer, otherwise I can keep only the comment ;) – Niccolò Campolungo Jun 19 '13 at 13:16
  • Will that definitely work on UTF code points that are longer than 1 byte? – Tom Leese Jun 19 '13 at 13:19
  • I don't know, but it should, can't you try? – Niccolò Campolungo Jun 19 '13 at 13:21
  • I tried it with `new Uint8Array("h€l".split('').map(function(char) {return char.charCodeAt(0);}));` and it returned an array with 3 bytes, however I believe it should be 5 bytes because occording to http://www.fileformat.info/info/unicode/char/20ac/index.htm it says the UTF-8 encoding of it is `0xE2 0x82 0xAC`. – Tom Leese Jun 19 '13 at 13:24
  • 8
    The one-liner you posted will decode bytes in the range 0x00–0xFF to their corresponding Unicode code points U+0000–U+00FF. In other words, it can’t represent anywhere near the whole Unicode range. However, it just so happens that Unicode code points U+0000–U+00FF correspond exactly to ISO 8859-1 (Latin 1), so what you have written is in effect an ISO 8859-1 decoder. LightStyle’s oneliner is the encoder that corresponds to the decoder in the question. In other words, it is an ISO 8859-1 encoder. – Daniel Cassidy Mar 24 '14 at 14:40
  • @TomLeese You fixed the spelling mistake and now I have no idea what it was :( – flarn2006 Nov 03 '17 at 19:22
  • Up-to-date answer here: https://stackoverflow.com/questions/6965107/converting-between-strings-and-arraybuffers – Tchakabam Apr 23 '20 at 12:24

8 Answers8

105

Using TextEncoder and TextDecoder

var uint8array = new TextEncoder("utf-8").encode("Plain Text");
var string = new TextDecoder().decode(uint8array);
console.log(uint8array ,string )
LWC
  • 1,084
  • 1
  • 10
  • 28
PPB
  • 2,937
  • 3
  • 17
  • 12
  • 10
    Support for this feature is [sorely lacking in IE and Edge](https://caniuse.com/#feat=textencoder). – Benproductions1 Nov 14 '17 at 06:58
  • And for some reason there is only a polyfill for TextEncoder, I'm assuming TextDecoding just simply wouldn't work in IE right now. – PeterS Mar 13 '19 at 16:56
  • Good answer but using "Plain Text" is misleading we aren't doing any cryptography here encode != encrypt – Joseph Garrone Oct 04 '19 at 21:34
  • If you need IE support, you can you use the [FastestSmallestTextEncoderDecoder polyfill](https://github.com/anonyco/FastestSmallestTextEncoderDecoder), recommended by the [MDN website](https://developer.mozilla.org/en-US/docs/Web/API/TextEncoder). – Rosberg Linhares Dec 05 '19 at 03:37
  • 5
    Notice that TextEncoder c`tor doesn't accept any argument (it's always utf-8, no matter what you pass in). However the decoder does accept argument (both the documentation and how it works practically aligns with this). – MaMazav Jun 26 '20 at 11:32
  • 1
    @JosephGarrone "plain text" isn't a term that is restricted to cryptography... – Qix - MONICA WAS MISTREATED Jun 20 '21 at 15:05
  • 3
    For anyone coming across this question in 2021, every major browser supports TextEncoder/Decoder now: https://caniuse.com/textencoder – uryga Jul 04 '21 at 09:48
46
function stringToUint(string) {
    var string = btoa(unescape(encodeURIComponent(string))),
        charList = string.split(''),
        uintArray = [];
    for (var i = 0; i < charList.length; i++) {
        uintArray.push(charList[i].charCodeAt(0));
    }
    return new Uint8Array(uintArray);
}

function uintToString(uintArray) {
    var encodedString = String.fromCharCode.apply(null, uintArray),
        decodedString = decodeURIComponent(escape(atob(encodedString)));
    return decodedString;
}

I have done, with some help from the internet, these little functions, they should solve your problems! Here is the working JSFiddle.

EDIT:

Since the source of the Uint8Array is external and you can't use atob you just need to remove it(working fiddle):

function uintToString(uintArray) {
    var encodedString = String.fromCharCode.apply(null, uintArray),
        decodedString = decodeURIComponent(escape(encodedString));
    return decodedString;
}

Warning: escape and unescape is removed from web standards. See this.

Anna
  • 319
  • 5
  • 18
Niccolò Campolungo
  • 11,824
  • 4
  • 32
  • 39
  • 1
    `atob/btoa` do base64 encoding/decoding, if you pass a honest utf8 byte array, it won't work: http://jsfiddle.net/Z9pQE/1/ – Esailija Jun 19 '13 at 13:46
  • 1
    It is planned to work only with an UintArray of an encoded string, otherwise it is not going to work because of `btoa` and `atob` conversion. – Niccolò Campolungo Jun 19 '13 at 13:47
  • I probably should've specified, but the UTF-8 string in the `ArrayBuffer` comes from a seperate program written in a different programming language which produces pure UTF-8 strings, so as Esailija said, I can't use this as it does base64 encoding. – Tom Leese Jun 19 '13 at 13:49
  • Wait. You can easily use this if the source is external, just don't use `atob` function. I'm going to update this with a new fiddle, just 1 minute – Niccolò Campolungo Jun 19 '13 at 13:51
  • 2
    Done. The same is true for the `stringToUint` function, just remove the `btoa` function and you're done :) – Niccolò Campolungo Jun 19 '13 at 13:55
  • You're welcome! Anyway, @Esailija your solution is great, worth +1! :D – Niccolò Campolungo Jun 19 '13 at 13:57
  • 2
    You saved my day! Just one addition, that if you use it with huge arrays, you can easily get: `[Error] RangeError: Maximum call stack size exceeded.` To fix that I use `.slice()` and apply it in chunks – Pengő Dzsó Feb 14 '14 at 18:32
  • Glad to help! Feel free to edit the answer and add your solution :) – Niccolò Campolungo Feb 14 '14 at 21:27
  • why the [`btoa()`](https://developer.mozilla.org/en-US/docs/Web/API/WindowBase64/btoa) call in `stringToUint()`? To me that's completely wrong and reducing that line to `var string = unescape(encodeURIComponent(string));` works better for me. – Udo G Apr 23 '15 at 12:27
  • Just something that should be noted: If your array is sufficiently large, this solution will cause a stack overflow on the call to String.fromCharCode.apply. For some solutions, a loop may be better. – aeskreis Jul 28 '16 at 16:36
  • 2
    This answer is outdated, go here: https://stackoverflow.com/questions/6965107/converting-between-strings-and-arraybuffers – Tchakabam Apr 23 '20 at 12:23
29

This should work:

// http://www.onicos.com/staff/iz/amuse/javascript/expert/utf.txt

/* utf.js - UTF-8 <=> UTF-16 convertion
 *
 * Copyright (C) 1999 Masanao Izumo <iz@onicos.co.jp>
 * Version: 1.0
 * LastModified: Dec 25 1999
 * This library is free.  You can redistribute it and/or modify it.
 */

function Utf8ArrayToStr(array) {
  var out, i, len, c;
  var char2, char3;

  out = "";
  len = array.length;
  i = 0;
  while (i < len) {
    c = array[i++];
    switch (c >> 4)
    { 
      case 0: case 1: case 2: case 3: case 4: case 5: case 6: case 7:
        // 0xxxxxxx
        out += String.fromCharCode(c);
        break;
      case 12: case 13:
        // 110x xxxx   10xx xxxx
        char2 = array[i++];
        out += String.fromCharCode(((c & 0x1F) << 6) | (char2 & 0x3F));
        break;
      case 14:
        // 1110 xxxx  10xx xxxx  10xx xxxx
        char2 = array[i++];
        char3 = array[i++];
        out += String.fromCharCode(((c & 0x0F) << 12) |
                                   ((char2 & 0x3F) << 6) |
                                   ((char3 & 0x3F) << 0));
        break;
    }
  }    
  return out;
}

It's somewhat cleaner as the other solutions because it doesn't use any hacks nor depends on Browser JS functions, e.g. works also in other JS environments.

Check out the JSFiddle demo.

Also see the related questions: here, here

Will
  • 2,014
  • 2
  • 19
  • 42
Albert
  • 65,406
  • 61
  • 242
  • 386
  • 5
    What about when going from string to utf-8 buffer? – Sámal Rasmussen May 24 '17 at 11:14
  • This is the least readable code I've ever seen to implement char-code to string conversion. I appreciate and admire the effort put into it, but there's 100s of more maintainable ways to achieve that. – ANTARA Aug 31 '23 at 20:27
23

There's a polyfill for Encoding over on Github: text-encoding. It's easy for Node or the browser, and the Readme advises the following:

var uint8array = TextEncoder(encoding).encode(string);
var string = TextDecoder(encoding).decode(uint8array);

If I recall, 'utf-8' is the encoding you need, and of course you'll need to wrap your buffer:

var uint8array = new Uint8Array(utf8buffer);

Hope it works as well for you as it has for me.

popham
  • 582
  • 4
  • 11
13

If you are doing this in browser there are no character encoding libraries built-in, but you can get by with:

function pad(n) {
    return n.length < 2 ? "0" + n : n;
}

var array = new Uint8Array(data);
var str = "";
for( var i = 0, len = array.length; i < len; ++i ) {
    str += ( "%" + pad(array[i].toString(16)))
}

str = decodeURIComponent(str);

Here's a demo that decodes a 3-byte UTF-8 unit: http://jsfiddle.net/Z9pQE/

Esailija
  • 138,174
  • 23
  • 272
  • 326
3

The methods readAsArrayBuffer and readAsText from a FileReader object converts a Blob object to an ArrayBuffer or to a DOMString asynchronous.

A Blob object type can be created from a raw text or byte array, for example.

let blob = new Blob([text], { type: "text/plain" });

let reader = new FileReader();
reader.onload = event =>
{
    let buffer = event.target.result;
};
reader.readAsArrayBuffer(blob);

I think it's better to pack up this in a promise:

function textToByteArray(text)
{
    let blob = new Blob([text], { type: "text/plain" });
    let reader = new FileReader();
    let done = function() { };

    reader.onload = event =>
    {
        done(new Uint8Array(event.target.result));
    };
    reader.readAsArrayBuffer(blob);

    return { done: function(callback) { done = callback; } }
}

function byteArrayToText(bytes, encoding)
{
    let blob = new Blob([bytes], { type: "application/octet-stream" });
    let reader = new FileReader();
    let done = function() { };

    reader.onload = event =>
    {
        done(event.target.result);
    };

    if(encoding) { reader.readAsText(blob, encoding); } else { reader.readAsText(blob); }

    return { done: function(callback) { done = callback; } }
}

let text = "\uD83D\uDCA9 = \u2661";
textToByteArray(text).done(bytes =>
{
    console.log(bytes);
    byteArrayToText(bytes, 'UTF-8').done(text => 
    {
        console.log(text); //  = ♡
    });
});
Martin Wantke
  • 4,287
  • 33
  • 21
3

If you don't want to use any external polyfill library, you can use this function provided by the Mozilla Developer Network website:

function utf8ArrayToString(aBytes) {
    var sView = "";
    
    for (var nPart, nLen = aBytes.length, nIdx = 0; nIdx < nLen; nIdx++) {
        nPart = aBytes[nIdx];
        
        sView += String.fromCharCode(
            nPart > 251 && nPart < 254 && nIdx + 5 < nLen ? /* six bytes */
                /* (nPart - 252 << 30) may be not so safe in ECMAScript! So...: */
                (nPart - 252) * 1073741824 + (aBytes[++nIdx] - 128 << 24) + (aBytes[++nIdx] - 128 << 18) + (aBytes[++nIdx] - 128 << 12) + (aBytes[++nIdx] - 128 << 6) + aBytes[++nIdx] - 128
            : nPart > 247 && nPart < 252 && nIdx + 4 < nLen ? /* five bytes */
                (nPart - 248 << 24) + (aBytes[++nIdx] - 128 << 18) + (aBytes[++nIdx] - 128 << 12) + (aBytes[++nIdx] - 128 << 6) + aBytes[++nIdx] - 128
            : nPart > 239 && nPart < 248 && nIdx + 3 < nLen ? /* four bytes */
                (nPart - 240 << 18) + (aBytes[++nIdx] - 128 << 12) + (aBytes[++nIdx] - 128 << 6) + aBytes[++nIdx] - 128
            : nPart > 223 && nPart < 240 && nIdx + 2 < nLen ? /* three bytes */
                (nPart - 224 << 12) + (aBytes[++nIdx] - 128 << 6) + aBytes[++nIdx] - 128
            : nPart > 191 && nPart < 224 && nIdx + 1 < nLen ? /* two bytes */
                (nPart - 192 << 6) + aBytes[++nIdx] - 128
            : /* nPart < 127 ? */ /* one byte */
                nPart
        );
    }
    
    return sView;
}

let str = utf8ArrayToString([50,72,226,130,130,32,43,32,79,226,130,130,32,226,135,140,32,50,72,226,130,130,79]);

// Must show 2H₂ + O₂ ⇌ 2H₂O
console.log(str);
Rosberg Linhares
  • 3,537
  • 1
  • 32
  • 35
  • 1
    see up-to-date answer: https://stackoverflow.com/questions/6965107/converting-between-strings-and-arraybuffers – Tchakabam Apr 23 '20 at 12:23
1

The main problem of programmers looking for conversion from byte array into a string is UTF-8 encoding (compression) of unicode characters. This code will help you:

var getString = function (strBytes) {

    var MAX_SIZE = 0x4000;
    var codeUnits = [];
    var highSurrogate;
    var lowSurrogate;
    var index = -1;

    var result = '';

    while (++index < strBytes.length) {
        var codePoint = Number(strBytes[index]);

        if (codePoint === (codePoint & 0x7F)) {

        } else if (0xF0 === (codePoint & 0xF0)) {
            codePoint ^= 0xF0;
            codePoint = (codePoint << 6) | (strBytes[++index] ^ 0x80);
            codePoint = (codePoint << 6) | (strBytes[++index] ^ 0x80);
            codePoint = (codePoint << 6) | (strBytes[++index] ^ 0x80);
        } else if (0xE0 === (codePoint & 0xE0)) {
            codePoint ^= 0xE0;
            codePoint = (codePoint << 6) | (strBytes[++index] ^ 0x80);
            codePoint = (codePoint << 6) | (strBytes[++index] ^ 0x80);
        } else if (0xC0 === (codePoint & 0xC0)) {
            codePoint ^= 0xC0;
            codePoint = (codePoint << 6) | (strBytes[++index] ^ 0x80);
        }

        if (!isFinite(codePoint) || codePoint < 0 || codePoint > 0x10FFFF || Math.floor(codePoint) != codePoint)
            throw RangeError('Invalid code point: ' + codePoint);

        if (codePoint <= 0xFFFF)
            codeUnits.push(codePoint);
        else {
            codePoint -= 0x10000;
            highSurrogate = (codePoint >> 10) | 0xD800;
            lowSurrogate = (codePoint % 0x400) | 0xDC00;
            codeUnits.push(highSurrogate, lowSurrogate);
        }
        if (index + 1 == strBytes.length || codeUnits.length > MAX_SIZE) {
            result += String.fromCharCode.apply(null, codeUnits);
            codeUnits.length = 0;
        }
    }

    return result;
}

All the best !

konak
  • 129
  • 4
  • Thats not complete. For samplle, `german umlauts` are missing! – Adrian Preuss Jan 19 '18 at 13:45
  • By the way ... I have noticed that there was invalid ordering in if statements. May be that was a problem your string was not processed. I have corrected in my codes, but forget to correct it in this post. – konak Jan 20 '18 at 14:26
  • 1
    `ö` = `RangeError: Invalid code point: 1581184`, `ü` = `RangeError: Invalid code point: 3678336` – Adrian Preuss Jan 21 '18 at 07:48
  • I have changed code above. please try it one more time. There was a problem with "else if" statements ordering .. Now it must work for your case too. That code was tested for more than 30 languages including Japan, korean, Arabic etc. languages. – konak Jan 21 '18 at 09:03
  • For example here are words I have transferred using bytes and restored to string in Javascript: Hälfte, Über, – konak Jan 21 '18 at 09:10