Questions tagged [cesu-8]

CESU-8 is non-standard UNICODE character encoding format close to UTF-8 with the exception that the characters points above U+FFFF are represented with UNICODE surrogate pairs encoded as 16 bit characters. Officially UTF-8 should contain those characters directly without using surrogate pairs as intermediate encoding.

CESU-8 defined by http://fileformats.archiveteam.org/:

CESU-8 is an inefficient Unicode character encoding related to UTF-8. It is not an accepted standard, but has been documented in the interest of practicality. It's what you get if you take UTF-16 data, reinterpret it as UCS-2, then convert it to UTF-8 (while ignoring any rules forbidding the use of code points in the range U+D800 to U+DFFF). A code point thus uses 1, 2, 3, or 6 bytes. It is sometimes used by accident, but may be used deliberately to accommodate systems that don't support 4-byte UTF-8 sequences, or when a close correspondence between UTF-16 and a UTF-8-like encoding is deemed necessary.

2 questions

votes

1 answer

Enable to decode/encode correctly from bytes in python 3.7.3

I'm struggling with this: b'"\xc2\xb7\xed\xa0\x81\xed\xb1\x96\xed\xa0\x81\xed\xb1\xb1\xed\xa0\x81\xed\xb1\x9d\xed\xa0\x81\xed\xb1\xbe\xed\xa0\x81\xed\xb1\xaf…

asked Oct 19 '19 at 14:48

Folkvir

votes

4 answers

Convert CESU-8 to UTF-8 with high performance

I have some raw text that is usually a valid UTF-8 string. However, every now and then it turns out that the input is in fact a CESU-8 string, instead. It is possible to technically detect this and convert to UTF-8 but as this happens rarely, I…

php performance unicode utf-8 cesu-8

asked Dec 08 '15 at 08:26

Mikko Rantalainen

14,132
10
74
112