Decode invalid utf-8 by replacing troublesome byte sequences with question marks?

Question

The problem: I'm given a sequence of bytes (say as a Uint8Array) which I'd like to interpret as a utf8-encoded string. That is, I'd like to decode the bytes into a valid unicode string.

However, it is possible that the bytes will not be a valid utf8-encoding. If that's the case, I'd like to do a "best effort" attempt to decode the string anyway.

In Python I can do the following:

>>> import codecs
>>> codecs.register_error('replace_?', lambda e: (u'?', e.start + 1))
>>> uint8array = map(ord, 'some mostly ok\x80string')
>>> uint8array
[115, 111, 109, 101, 32, 109, 111, 115, 116, 108, 121, 32, 111, 107, 128, 115, 116, 114, 105, 110, 103]
>>> ''.join(map(chr, uint8array)).decode('utf8', 'replace_?')
u'some mostly ok?string'

In JavaScript, I've learned the decoding would go as follows:

> uint8array = new Uint8Array([115, 111, 109, 101, 32, 109, 111, 115, 116, 108, 121, 32, 111, 107, 128, 115, 116, 114, 105, 110, 103])
[115, 111, 109, 101, 32, 109, 111, 115, 116, 108, 121, 32, 111, 107, 128, 115, 116, 114, 105, 110, 103]
> decodeURIComponent(escape(String.fromCharCode.apply(null, uint8array)))
Uncaught URIError: URI malformed(…)

As you can see, this raises an exception, much like the Python code would if I didn't specify my custom codec handler.

How would I go about getting the same behavior as the Python snippet - replacing the malformed utf8 bytes with '?' instead of choking on the whole string?

If all you want to do is interpret the bytes as a UTF-8 string, what do the `escape()` and `decodeURIComponent()` functions have to do with anything? — Pointy, Apr 19 '16 at 14:50
If you just pass those values to `.fromCharCode()` you get a string, which should be clear by the fact that the error you're getting is coming from one of those URI-related functions. — Pointy, Apr 19 '16 at 14:52
@Pointy: That's just the weird way in which you decode a utf8-encoded string in JavaScript, see [this answer](http://stackoverflow.com/a/13691499/15055) for example. Just applying `fromCharCode` to the bytes won't get me a unicode string, [see here](https://jsfiddle.net/mzwuu4th/). — Claudiu, Apr 19 '16 at 15:03
*All* strings in JavaScript are unicode; JavaScript represents strings internally as UTF-16. — Pointy, Apr 19 '16 at 15:04
@Pointy: Yes. I'm given **bytes** which represent a utf8-encoding of a unicode string. I want to decode these **bytes** into a unicode string. `decodeURIComponent(escape(String.fromCharCode.apply(null, uint8array)))` is how you do that. If I start from the bytes `[0xe2, 0x82, 0xac]`, the correct result is `€`, not `â¬`. — Claudiu, Apr 19 '16 at 15:08
Well that technique looks like a terrible hack to me, and one reason for that is the very problem you're encountering. There are straightforward ways of interpreting UTF encodings, and I personally would not hesitate to implement that to get something that would perform perfectly well in addition to being flexible for handling encoding errors. — Pointy, Apr 19 '16 at 15:11
Aye, a terrible hack it is! It's just what I came across when googling "decode utf8 javascript". I was wondering what the best way to this properly would be, hence my question. I was hoping I wouldn't have to implement the utf8 spec myself. — Claudiu, Apr 19 '16 at 15:13
There seem to be many projects on GitHub for this purpose. I think basically it's a matter of reading the UTF-8 bytes and assembling UTF-32 values, then going from there back to UTF-16 for `.fromCharCode()`. [Here is one example.](https://github.com/nfroidure/UTF8.js/blob/master/src/UTF8.js) — Pointy, Apr 19 '16 at 15:17

Decode invalid utf-8 by replacing troublesome byte sequences with question marks?

0 Answers0