Enable to decode/encode correctly from bytes in python 3.7.3

Question

I'm struggling with this:

b'"\xc2\xb7\xed\xa0\x81\xed\xb1\x96\xed\xa0\x81\xed\xb1\xb1\xed\xa0\x81\xed\xb1\x9d\xed\xa0\x81\xed\xb1\xbe\xed\xa0\x81\xed\xb1\xaf \xed\xa0\x81\xed\xb1\xa9\xed\xa0\x81\xed\xb1\xa4\xed\xa0\x81\xed\xb1\x93\xed\xa0\x81\xed\xb1\xa9\xed\xa0\x81\xed\xb1\x9a\xed\xa0\x81\xed\xb1\xa7\xed\xa0\x81\xed\xb1\x91"@en'

which comes from a binary format coming from the HDT compressed version (https://github.com/rdfhdt/hdt-cpp) of (dbpedia 3.5.1 (http://dbpedia.org/page/Shavian_alphabet)) and is well decoded in utf8 by this website (https://mothereff.in/utf-8)

And the meaning is: "· "@en

But in python 3.7.3 I encountered the well-known error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 3: invalid continuation byte when trying to mystring.decode('utf8')

If I try to do the contrary: '"· "@en'.encode('utf8)I get the following representation: b'"\xf0\x90\x91\x96\xf0\x90\x91\xb1\xf0\x90\x91\x9d\xf0\x90\x91\xbe\xf0\x90\x91\xaf \xf0\x90\x91\xa8\xf0\x90\x91\xa4\xf0\x90\x91\x93\xf0\x90\x91\xa9\xf0\x90\x91\x9a\xf0\x90\x91\xa7\xf0\x90\x91\x91"@en' which is not the exact same string, but is then decoded repr.decode('utf8') correctly into the same thing....

Can someone help me to understand why decoding the first bytes string is not working? I know the first bytes string is not a valid UTF-8 string due to the error. But then, why is it well decoded by the website I linked and cant be done by python? Thank you in advance!

FINAL EDIT After having accepted the answer I did a few extra researches on this and found this string was encoded using the CESU-8 codec. Which is clearly deprecated today. But some are still using it... So, I found a package which write a variants of the utf-8 codec which can decode this string. I think it will help a lot of people with the same problem as me. Python library: https://github.com/LuminosoInsight/python-ftfy The added codec is 'utf-8-variants'. I hope this will help people in the same needs than me.

Are you asking `why does it work on that website?` or are you asking `how do I convert this bytestring to 'xyz...'?` — wwii, Oct 19 '19 at 15:14
Both, I think,. How to convert this bytesstring and how does it work on the website? — Folkvir, Oct 19 '19 at 15:16
Related: [UnicodeDecodeError, invalid continuation byte](https://stackoverflow.com/questions/5552555/unicodedecodeerror-invalid-continuation-byte) — wwii, Oct 19 '19 at 15:56
@wwii In this case, it is not really an invalid continuation byte, but a problem on the next level, as I explained in my answer. — zvone, Oct 19 '19 at 15:58

score 7 · Accepted Answer · edited Oct 07 '21 at 08:14

It seems that Python does not want to accept some sequence of bytes as valid UTF-8, whereas some website (https://mothereff.in/utf-8) accepts it. One of them must be wrong, right? Let's see.

The first two bytes (b'\xc2\xb7') are accepted by Python. The first thing which Python does not like is this: \xed\xa0\x81\xed\xb1\x96, which is interpreted on that website as .

Let's look at \xed\xa0\x81\xed\xb1\x96 in binary format:

RFC3629 says that UTF-8 is interpreted as:

   Char. number range  |        UTF-8 octet sequence
      (hexadecimal)    |              (binary)
   --------------------+---------------------------------------------
   0000 0000-0000 007F | 0xxxxxxx
   0000 0080-0000 07FF | 110xxxxx 10xxxxxx
   0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
   0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Therefore, there are two three-byte characters:

11101101 10100000 10000001 ⇒ 1101100000000001, or D801

11101101 10110001 10010110 ⇒ 1101110001010110, or DC56

Character D801 is one of the high surrogates and DC56 is one of the low surrogates.

You can see here how to combine the surrogates:

A surrogate pair denotes the code point 0x10000 + (H − 0xD800) × 0x400

(L − 0xDC00) where H and L are the numeric values of the high and low surrogates respectively.

If you combine them, you'll get:

0x10000 + (0xD801 - 0xD800) * 0x400 + (0xDC56 - 0xDC00) = 0x10456, which is

However, the high and low surrogates were designed for UTF-16 representation of characters which do not fit into 16 bits, and this is what unicode.org says about using such surrogate pairs in UTF-8:

Q: How do I convert a UTF-16 surrogate pair such as <D800 DC00> to UTF-8? As one 4-byte sequence or as two separate 3-byte sequences?

A: The definition of UTF-8 requires that supplementary characters (those using surrogate pairs in UTF-16) be encoded with a single 4-byte sequence. However, there is a widespread practice of generating pairs of 3-byte sequences in older software, especially software which pre-dates the introduction of UTF-16 or that is interoperating with UTF-16 environments under particular constraints. Such an encoding is not conformant to UTF-8 as defined. See UTR #26: Compatability Encoding Scheme for UTF-16: 8-bit (CESU) for a formal description of such a non-UTF-8 data format. When using CESU-8, great care must be taken that data is not accidentally treated as if it was UTF-8, due to the similarity of the formats. [AF]

The key point here is "Such an encoding is not conformant to UTF-8 as defined". So, your input is in fact an invalid UTF-8 sequence, and Python rejected it as such.

To answer the question:

https://mothereff.in/utf-8 is ignoring the unicode.org's instruction to treat this as invalid.
Python is treating this as invalid.
If you want to decode it, even though it is invalid, you can write a function which does what I did manually.

Btw, do you have any idea of a library doing this (to not reinvent the wheel)? If not I'll create my own converter based on your indications. — Folkvir, Oct 19 '19 at 17:04
@Folkvir I'm afraid I don't know of such a library. It might exists. It's up to you to decide whether to spend time looking for it or reinventing the wheel ;) — zvone, Oct 19 '19 at 21:14

Enable to decode/encode correctly from bytes in python 3.7.3

1 Answers1