How to parse non-UTF8 XML in browsers with Javascript?

Question

I have a XML string encoded in big5:

atob('PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iYmlnNSIgPz48dGl0bGU+pKSk5TwvdGl0bGU+')

(<?xml version="1.0" encoding="big5" ?><title>中文</title> in UTF-8.)

I'd like to extract the content of <title>. How can I do that with pure Javascript in browsers? Better to have lightweight solutions without jquery or emscripten.

Have tried DOMParser:

(new DOMParser()).parseFromString(atob('PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iYmlnNSIgPz48dGl0bGU+pKSk5TwvdGl0bGU+'), 'text/xml')

But neither Chromium nor Firefox respects the encoding attribute. Is it a standard that DOMParser supports UTF-8 only?

Maybe a silly question that exposes my ignorance, but how are you checking that the encoding attribute is not respected? — Michal Charemza, Jul 20 '16 at 18:44
Also, in your real case, is the string encoded as big5, and then base64, as in your example here? — Michal Charemza, Jul 20 '16 at 20:04
As a reference for future visitors, real codes are here: https://github.com/yan12125/chrome_newtab/blob/c2336374c74cce438c956812b7639ed74ede619f/content/newtab.js#L70-L77. This is an old commit of my project, which now uses TextEncoder mentioned below. — Chih-Hsuan Yen, Jul 26 '16 at 03:39

Michal Charemza · Accepted Answer · 2016-07-21T05:06:15.633

5

I suspect the issue isn't DOMParser, but atob, which can't properly decode what was originally a non-ascii string.*

You will need to use another method to get at the original bytes, such as using https://github.com/danguer/blog-examples/blob/master/js/base64-binary.js

var encoded = 'PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iYmlnNSIgPz48dGl0bGU+pKSk5TwvdGl0bGU+';
var bytes = Base64Binary.decode(encoded);

and then some method to convert the bytes (i.e. decode the big5 data) into a Javascript string. For Firefox / Chrome, you can use TextDecoder:

var decoder = new TextDecoder('big5'); 
var decoded = decoder.decode(bytes);

And then pass to DOMParser

var dom = (new DOMParser()).parseFromString(decoded, 'text/xml');
var title = dom.children[0].textContent;

You can see this at https://plnkr.co/edit/TBspXlF2vNbNaKq8UxhW?p=preview

*One way of understanding why: atob doesn't take the encoding of the original string as a parameter, so while it must internally decode base64 encoded data to bytes, it has to make an assumption on what character encoding those bytes are to then give you a Javascript string of characters, which I believe is internally encoded as UTF-16.

edited Jul 21 '16 at 05:06

answered Jul 20 '16 at 20:45

Michal Charemza

25,940
14
98
165

Thanks for that. TextEncoder/TextDecoder is indeed what I used later. atob is problematic, as well as DOMParser. In a bug report at https://bugzilla.mozilla.org/show_bug.cgi?id=1287071, a Mozilla developer has confirmed that DOMParser assumes all inputs to be UTF-8. In fact from dom/base/DOMParser.cpp of mozilla-central, it's easy to see that parseFromString uses a hard-coded encoding UTF-8. The TextDecoder approach requires knowing the encoding a priori. It's less than ideal but sufficient for my project. – Chih-Hsuan Yen Jul 26 '16 at 03:45
1

Just for reference I think it converts from UTF-16 to UTF-8 internally https://github.com/mozilla/gecko-dev/blob/master/dom/base/DOMParser.cpp#L116 . Not sure that makes a difference to your situation, admittedly. – Michal Charemza Jul 26 '16 at 05:35
Thanks for that. Seems all Javascript strings are assumed to be UTF-16 on the C level? – Chih-Hsuan Yen Jul 27 '16 at 03:41
I believe so. (Although slightly strange to say "assumed"... they *are* UTF-16). – Michal Charemza Jul 27 '16 at 03:45

milahu · Answer 2 · 2022-12-24T10:45:31.540

related: parse document from non-utf8 html

/**
* parse html document from http response. \
* also handle non-utf8 data.
*
* use this instead of
* ```
* const html = await response.text()
* const doc = new DOMParser().parseFromString(html, "text/html");
* ```
*
* @param {Response} response
* @return {Document}
*/
async function documentOfResponse(response) {
  // example content-type: text/html; charset=ISO-8859-1
  const type = response.headers.get("content-type").split(";")[0] || "text/html"
  const charset = (response.headers.get("content-type").match(/;\s*charset=(.*)(?:;|$)/) || [])[1]
  let html = ""
  if (charset && charset != "UTF-8") { // TODO check more? utf-8, utf8, UTF8, ...
    const decoder = new TextDecoder(charset)
    const buffer = await response.arrayBuffer()
    html = decoder.decode(buffer) // convert to utf8
  }
  else {
    html = await response.text()
  }
  return new DOMParser().parseFromString(html, type)
}

// demo
const response = await fetch("https://github.com/")
const doc = await documentOfResponse(response)
const title = doc.querySelector("title")
console.log(title)

How to parse non-UTF8 XML in browsers with Javascript?

2 Answers2

Linked