Questions tagged [cesu-8]

CESU-8 is non-standard UNICODE character encoding format close to UTF-8 with the exception that the characters points above U+FFFF are represented with UNICODE surrogate pairs encoded as 16 bit characters. Officially UTF-8 should contain those characters directly without using surrogate pairs as intermediate encoding.

CESU-8 defined by http://fileformats.archiveteam.org/:

CESU-8 is an inefficient Unicode character encoding related to UTF-8. It is not an accepted standard, but has been documented in the interest of practicality. It's what you get if you take UTF-16 data, reinterpret it as UCS-2, then convert it to UTF-8 (while ignoring any rules forbidding the use of code points in the range U+D800 to U+DFFF). A code point thus uses 1, 2, 3, or 6 bytes. It is sometimes used by accident, but may be used deliberately to accommodate systems that don't support 4-byte UTF-8 sequences, or when a close correspondence between UTF-16 and a UTF-8-like encoding is deemed necessary.

2 questions
6
votes
1 answer

Enable to decode/encode correctly from bytes in python 3.7.3

I'm struggling with this: b'"\xc2\xb7\xed\xa0\x81\xed\xb1\x96\xed\xa0\x81\xed\xb1\xb1\xed\xa0\x81\xed\xb1\x9d\xed\xa0\x81\xed\xb1\xbe\xed\xa0\x81\xed\xb1\xaf…
Folkvir
  • 280
  • 2
  • 12
2
votes
4 answers

Convert CESU-8 to UTF-8 with high performance

I have some raw text that is usually a valid UTF-8 string. However, every now and then it turns out that the input is in fact a CESU-8 string, instead. It is possible to technically detect this and convert to UTF-8 but as this happens rarely, I…
Mikko Rantalainen
  • 14,132
  • 10
  • 74
  • 112