The problem: I'm given a sequence of bytes (say as a Uint8Array
) which I'd like to interpret as a utf8-encoded string. That is, I'd like to decode the bytes into a valid unicode string.
However, it is possible that the bytes will not be a valid utf8-encoding. If that's the case, I'd like to do a "best effort" attempt to decode the string anyway.
In Python I can do the following:
>>> import codecs
>>> codecs.register_error('replace_?', lambda e: (u'?', e.start + 1))
>>> uint8array = map(ord, 'some mostly ok\x80string')
>>> uint8array
[115, 111, 109, 101, 32, 109, 111, 115, 116, 108, 121, 32, 111, 107, 128, 115, 116, 114, 105, 110, 103]
>>> ''.join(map(chr, uint8array)).decode('utf8', 'replace_?')
u'some mostly ok?string'
In JavaScript, I've learned the decoding would go as follows:
> uint8array = new Uint8Array([115, 111, 109, 101, 32, 109, 111, 115, 116, 108, 121, 32, 111, 107, 128, 115, 116, 114, 105, 110, 103])
[115, 111, 109, 101, 32, 109, 111, 115, 116, 108, 121, 32, 111, 107, 128, 115, 116, 114, 105, 110, 103]
> decodeURIComponent(escape(String.fromCharCode.apply(null, uint8array)))
Uncaught URIError: URI malformed(…)
As you can see, this raises an exception, much like the Python code would if I didn't specify my custom codec handler.
How would I go about getting the same behavior as the Python snippet - replacing the malformed utf8 bytes with '?'
instead of choking on the whole string?