How to convert a binary in surrogate pairs to unicode?

Question

Anyone can help figure out a surrogate pairs problem?

The source binary is #{EDA0BDEDB883}（encoded by Hessian/Java）, how to decode it to or "^(1F603)"? I have checked UTF-16 wiki, but it only telled me a half story.

My problem is how to convert #{EDA0BDEDB883} to \ud83d and \ude03?

My aim is to rewrite the python3 program to Rebol or Red-lang，just parsing binary data, without any libs.

This is how python3 do it:

def _decode_surrogate_pair(c1, c2):
    """
    Python 3 no longer decodes surrogate pairs for us; we have to do it
    ourselves.
    """
    # print(c1.encode('utf-8')) # \ud83d
    # print(c2.encode('utf-8')) # \ude03
    if not('\uD800' <= c1 <= '\uDBFF') or not ('\uDC00' <= c2 <= '\uDFFF'):
        raise Exception("Invalid UTF-16 surrogate pair")
    code = 0x10000
    code += (ord(c1) & 0x03FF) << 10
    code += (ord(c2) & 0x03FF)
    return chr(code)

def _decode_byte_array(bytes):
    s = ''
    while(len(bytes)):
        b, bytes = bytes[0], bytes[1:]
   
        c = b.decode('utf-8', 'surrogatepass')
        if '\uD800' <= c <= '\uDBFF':
            b, bytes = bytes[0], bytes[1:]
            c2 = b.decode('utf-8', 'surrogatepass')
            c = _decode_surrogate_pair(c, c2)
        s += c
    return s

bytes = [b'\xed\xa0\xbd', b'\xed\xb8\x83']

print(_decode_byte_array(bytes))

public static void main(String[] args) throws Exception {
    // ""
    // "\uD83D\uDE03"
    final byte[] bytes1 = "".getBytes(StandardCharsets.UTF_16);
    // [-2, -1, -40, 61, -34, 3]
    // #{FFFFFFFE}  #{FFFFFFFF} #{FFFFFFD8} #{0000003D}  #{FFFFFFDE}  #{00000003}
    System.out.println(Arrays.toString(bytes1));
}

@Sweeper Just add some java code in the post. don't know where `#{EDA0BDEDB883}` come from. — tianwen, Jan 15 '23 at 04:54
#{EDA0BD} and #{EDB883} is the utf-8 form of `\ud83d` and `\ude03` separately. I know where I made a mistake. — tianwen, Jan 15 '23 at 05:09
Apply [The WTF-8 encoding](https://simonsapin.github.io/wtf-8/) - decode from potentially ill-formed `UTF-16` to code points and vice versa… — JosefZ, Jan 15 '23 at 09:59
Java uses **Modified UTF-8**, see [What does it mean to say "Java Modified UTF-8 Encoding"?](https://stackoverflow.com/questions/7921016/) — Remy Lebeau, Jan 18 '23 at 20:08

How to convert a binary in surrogate pairs to unicode?

0 Answers0