3

I need to get an array buffer from an http request sending me a base64 answer. For this request, I can't use XMLHttpRequest.responseType="arraybuffer".

The response I get from this request is read through xhr.responseText. Hence it's encoded as a DOMString. I'm trying to get it back as an array buffer.

I've tried to go back to the base64 from the DOMString using btoa(mysString) or window.btoa(unescape(encodeURIComponent(str))) but the first option just fails, whereas the second option doesn't give the same base64. Example of the first few characters from each base64:

Incoming : UEsDBBQACAgIACp750oAAAAAAAAAAAAAAAALAAAAX3JlbHMvLnJlbH

After the second processing: UEsDBBQACAgIAO+/ve+/ve+/vUoAAAAAAAAAAAAAAAALAAAAX3JlbHMvLnJlbH

As you can see a part of it is similar, but some parts are way off. What am I missing to get it right?

Py.
  • 3,499
  • 1
  • 34
  • 53
  • 1
    Possible duplicate of https://stackoverflow.com/questions/21797299/convert-base64-string-to-arraybuffer The first answer should work – Cornelius Fillmore Jul 07 '17 at 15:12
  • Tried it and it doesn't work. Anyway, the question you linked to is starting from the base 64 string and converting it to an arraybuffer. I don't have that base 64 string right now. – Py. Jul 07 '17 at 20:40

1 Answers1

2

I have got same issue too.

The solution (I ran at Chrome(68.0.3440.84))

let url = ''

let iso_8859_15_table = { 338: 188, 339: 189, 352: 166, 353: 168, 376: 190, 381: 180, 382: 184, 8364: 164 }

function iso_8859_15_to_uint8array(iso_8859_15_str) {
    let buf = new ArrayBuffer(iso_8859_15_str.length);
    let bufView = new Uint8Array(buf);
    for (let i = 0, strLen = iso_8859_15_str.length; i < strLen; i++) {
        let octet = iso_8859_15_str.charCodeAt(i);
        if (iso_8859_15_table.hasOwnProperty(octet))
            octet = iso_8859_15_table[octet]
        bufView[i] = octet;
        if(octet < 0 || 255 < octet)
            console.error(`invalid data error`)
    }
    return bufView
}

req = new XMLHttpRequest();
req.overrideMimeType('text/plain; charset=ISO-8859-15');
req.onload = () => {
    console.log(`Uint8Array : `)
    var uint8array = iso_8859_15_to_uint8array(req.responseText)
    console.log(uint8array)
}
req.open("get", url);
req.send();

Below is explanation what I learned to solve it.

Explanation

Why some parts are way off?

because TextDecoder cause data loss (Your case is utf-8).

For example, let's talk about UTF-8

  • variable width character encoding for Unicode.

  • It has rules(This will become problem.) for reasons such as variable length characteristics and ASCII compatibility, etc.

  • so, decoder may replace a non-conforming characters to replacement character such as U+003F(?, Question mark) or U+FFFD(�, Unicode replacement character).

  • in utf-8 case, 0~127 of values are stable, 128~255 of values are unstable. 128~255 will converted to U+FFFD

Are other Text Decoders safe except UTF-8?

No. In most cases, not safe from rules.

UTF-8 is also unrecoverable. (128~255 are set to U+FFFD)

If the binary data and the decoded result can be corresponded to one-to-one, they can be recovered.

How to solve it?

  1. Finds recoverable Text Decoders.
  2. Force MIME type to recoverable charset of the incoming data. xhr_object.overrideMimeType('text/plain; charset=ISO-8859-15')
  3. Recover binary data from string with recover table when received.

Finds recoverable Text Decoders.

To recover, avoid the situation when decoded results' are duplicated.

The following code is a simple example, so there may be missing recoverable text decoders because it only consider Uint8Array.

let bufferView = new Uint8Array(256);
for (let i = 0; i < 256; i++)
    bufferView[i] = i;

let recoverable = []
let decoding = ['utf-8', 'ibm866', 'iso-8859-2', 'iso-8859-3', 'iso-8859-4', 'iso-8859-5', 'iso-8859-6', 'iso-8859-7', 'iso-8859-8', 'iso-8859-8i', 'iso-8859-10', 'iso-8859-13', 'iso-8859-14', 'iso-8859-15', 'iso-8859-16', 'koi8-r', 'koi8-u', 'macintosh', 'windows-874', 'windows-1250', 'windows-1251', 'windows-1252', 'windows-1253', 'windows-1254', 'windows-1255', 'windows-1256', 'windows-1257', 'windows-1258', 'x-mac-cyrillic', 'gbk', 'gb18030', 'hz-gb-2312', 'big5', 'euc-jp', 'iso-2022-jp', 'shift-jis', 'euc-kr', 'iso-2022-kr', 'utf-16be', 'utf-16le', 'x-user-defined', 'ISO-2022-CN', 'ISO-2022-CN-ext']
for (let dec of decoding) {
    try {
        let decodedText = new TextDecoder(dec).decode(bufferView);
        let loss = 0
        let recoverTable = {}
        let unrecoverable = 0
        for (let i = 0; i < decodedText.length; i++) {
            let charCode = decodedText.charCodeAt(i)
            if (charCode != i)
                loss++

            if (!recoverTable[charCode])
                recoverTable[charCode] = i
            else
                unrecoverable++
        }
        let tableCnt = 0
        for (let props in recoverTable) {
            tableCnt++
        }
        if (tableCnt == 256 && unrecoverable == 0){
            recoverable.push(dec)
            setTimeout(()=>{
                console.log(`[${dec}] : err(${loss}/${decodedText.length}, ${Math.round(loss / decodedText.length * 100)}%) alive(${tableCnt}) unrecoverable(${unrecoverable})`)
            },10)
        }
        else {
            console.log(`!! [${dec}] : err(${loss}/${decodedText.length}, ${Math.round(loss / decodedText.length * 100)}%) alive(${tableCnt}) unrecoverable(${unrecoverable})`)
        }
    } catch (e) {
        console.log(`!! [${dec}] : not supported.`)
    }
}

setTimeout(()=>{
    console.log(`recoverable Charset : ${recoverable}`)
}, 10)

In my console, this return

recoverable Charset : ibm866,iso-8859-2,iso-8859-4,iso-8859-5,iso-8859-10,iso-8859-13,iso-8859-14,iso-8859-15,iso-8859-16,koi8-r,koi8-u,macintosh,windows-1250,windows-1251,windows-1252,windows-1254,windows-1256,windows-1258,x-mac-cyrillic,x-user-defined

And I used iso-8859-15 at beginning of this answer. (It has Smallest table size.)


Additional test) Comparison between UTF-8's and ISO-8859-15's result

Check U+FFFD is really disappeared when using ISO-8859-15.

function requestAjax(url, charset) {
    let req = new XMLHttpRequest();
    if (charset)
        req.overrideMimeType(`text/plain; charset=${charset}`);
    else
        charset = 'utf-8';
    req.open('get', url);
    req.onload = () => {
        console.log(`==========\n${charset}`)
        console.log(`${req.responseText.split('', 50)}\n==========`);
        console.log('\n')
    }
    req.send();
}

var url = '';
requestAjax(url, 'ISO-8859-15');
requestAjax(url);

Bottom line

  • Recover binary data to, from string needs some additional job.
    • Find recoverable text encoder/decoder.
    • Make a recover table
    • Recover with the table.
    • (You can refer to the very top of code.)
  • For use this trick, force MIME type of incoming data to desired charset.
mgcation
  • 517
  • 6
  • 17