0

Anyone can help figure out a surrogate pairs problem?

The source binary is #{EDA0BDEDB883}(encoded by Hessian/Java), how to decode it to or "^(1F603)"? I have checked UTF-16 wiki, but it only telled me a half story.

My problem is how to convert #{EDA0BDEDB883} to \ud83d and \ude03?

My aim is to rewrite the python3 program to Rebol or Red-lang,just parsing binary data, without any libs.

This is how python3 do it:

def _decode_surrogate_pair(c1, c2):
    """
    Python 3 no longer decodes surrogate pairs for us; we have to do it
    ourselves.
    """
    # print(c1.encode('utf-8')) # \ud83d
    # print(c2.encode('utf-8')) # \ude03
    if not('\uD800' <= c1 <= '\uDBFF') or not ('\uDC00' <= c2 <= '\uDFFF'):
        raise Exception("Invalid UTF-16 surrogate pair")
    code = 0x10000
    code += (ord(c1) & 0x03FF) << 10
    code += (ord(c2) & 0x03FF)
    return chr(code)

def _decode_byte_array(bytes):
    s = ''
    while(len(bytes)):
        b, bytes = bytes[0], bytes[1:]
   
        c = b.decode('utf-8', 'surrogatepass')
        if '\uD800' <= c <= '\uDBFF':
            b, bytes = bytes[0], bytes[1:]
            c2 = b.decode('utf-8', 'surrogatepass')
            c = _decode_surrogate_pair(c, c2)
        s += c
    return s

bytes = [b'\xed\xa0\xbd', b'\xed\xb8\x83']

print(_decode_byte_array(bytes))

public static void main(String[] args) throws Exception {
    // ""
    // "\uD83D\uDE03"
    final byte[] bytes1 = "".getBytes(StandardCharsets.UTF_16);
    // [-2, -1, -40, 61, -34, 3]
    // #{FFFFFFFE}  #{FFFFFFFF} #{FFFFFFD8} #{0000003D}  #{FFFFFFDE}  #{00000003}
    System.out.println(Arrays.toString(bytes1));
}
tianwen
  • 1
  • 2
  • What have you tried *in Java* and how did it fail? – Sweeper Jan 15 '23 at 04:22
  • @Sweeper Just add some java code in the post. don't know where `#{EDA0BDEDB883}` come from. – tianwen Jan 15 '23 at 04:54
  • #{EDA0BD} and #{EDB883} is the utf-8 form of `\ud83d` and `\ude03` separately. I know where I made a mistake. – tianwen Jan 15 '23 at 05:09
  • Apply [The WTF-8 encoding](https://simonsapin.github.io/wtf-8/) - decode from potentially ill-formed `UTF-16` to code points and vice versa… – JosefZ Jan 15 '23 at 09:59
  • Java uses **Modified UTF-8**, see [What does it mean to say "Java Modified UTF-8 Encoding"?](https://stackoverflow.com/questions/7921016/) – Remy Lebeau Jan 18 '23 at 20:08

0 Answers0