Anyone can help figure out a surrogate pairs problem?
The source binary is #{EDA0BDEDB883}
(encoded by Hessian/Java), how to decode it to or
"^(1F603)"
?
I have checked UTF-16 wiki, but it only telled me a half story.
My problem is how to convert #{EDA0BDEDB883}
to \ud83d
and \ude03
?
My aim is to rewrite the python3 program to Rebol
or Red-lang
,just parsing binary data, without any libs.
This is how python3 do it:
def _decode_surrogate_pair(c1, c2):
"""
Python 3 no longer decodes surrogate pairs for us; we have to do it
ourselves.
"""
# print(c1.encode('utf-8')) # \ud83d
# print(c2.encode('utf-8')) # \ude03
if not('\uD800' <= c1 <= '\uDBFF') or not ('\uDC00' <= c2 <= '\uDFFF'):
raise Exception("Invalid UTF-16 surrogate pair")
code = 0x10000
code += (ord(c1) & 0x03FF) << 10
code += (ord(c2) & 0x03FF)
return chr(code)
def _decode_byte_array(bytes):
s = ''
while(len(bytes)):
b, bytes = bytes[0], bytes[1:]
c = b.decode('utf-8', 'surrogatepass')
if '\uD800' <= c <= '\uDBFF':
b, bytes = bytes[0], bytes[1:]
c2 = b.decode('utf-8', 'surrogatepass')
c = _decode_surrogate_pair(c, c2)
s += c
return s
bytes = [b'\xed\xa0\xbd', b'\xed\xb8\x83']
print(_decode_byte_array(bytes))
public static void main(String[] args) throws Exception {
// ""
// "\uD83D\uDE03"
final byte[] bytes1 = "".getBytes(StandardCharsets.UTF_16);
// [-2, -1, -40, 61, -34, 3]
// #{FFFFFFFE} #{FFFFFFFF} #{FFFFFFD8} #{0000003D} #{FFFFFFDE} #{00000003}
System.out.println(Arrays.toString(bytes1));
}