Python misinterprets 3 character string as UTF-8 continuation byte

Question

When saving a Pandas dataset to Excel I ran into

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe6 in position 0: invalid continuation byte

Some digging showed that I can put together 3 ascii characters and the resulting string appears to start with an UTF-8 continuation byte. Obviously there're no multibyte characters in the string. What is the best way overcome this so that all my data is interpreted as ASCII characters?

Here is Python code that demonstrates how continuation byte manifests

Python 3.7.1 (default, Dec 14 2018, 13:28:58)
[Clang 4.0.1 (tags/RELEASE_401/final)] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> string_from_3_ascii_chars = chr(50) + chr(51) + chr(48)
>>> print(string_from_3_ascii_chars)
230
>>> print(string_from_3_ascii_chars.startswith(str(0xe6)))
True
>>>

`str(0xe6)` is `'230'` (a string with three characters) – not the same as `chr(0xe6)`, which is `'\xe6'` or `'æ'` (a string with one character) – again not the same as the *byte* `0xe6`, like in the byte string `b'\xe6'`. — lenz, Aug 16 '19 at 22:19
@lenz yes, I want string to be '230'. My question is how to make `startswith(str(0xe6)` to return `False` — Oleg Zhylin, Aug 16 '19 at 23:22
`x.startswith('230')` is true if and only if `x` starts with the characters `'2'`, `'3'`, and `'0'`. But this is completely unrelated to the initially mentioned UnicodeDecodeError. And also, Python does *not* misinterpret a 3-character string as a continuation byte. — lenz, Aug 17 '19 at 07:43
My guess is that the error message is trying to say there are invalid continuation bytes _after_ the first \xE6 byte, otherwise it doesn't make sense. And like @lenz says, this has nothing to do with your experiments. — Mr Lister, Aug 17 '19 at 08:00
@MrLister The error message says that the UTF-8 decoder expected the next byte to be a continuation byte (binary `10XXXXXX`), but it encountered `E6` (`11100110`). It's rather peculiar that this happens on a save operation (you'd rather expect an *encoding* error, not a decoding problem), but it's not impossible. — lenz, Aug 17 '19 at 09:06
@lenz Yes, but the error also says "in position 0", which is where continuation bytes can't occur. — Mr Lister, Aug 17 '19 at 09:13
@MrLister oh, good point. In fact you are right: the error message can be reproduced with `b'\xe6a'.decode('utf8')` — lenz, Aug 17 '19 at 09:16
@MrLister indeed, continuation byte there doesn't make sense. Would be great to figure out how any kind of continuation bytes show up in a string made from 3 normal ASCII characters. — Oleg Zhylin, Aug 18 '19 at 04:29
@OlegZhylin It would be very helpful, if you could provide a minimal example, that reproduces the error, because (as others have mentioned before) your "Python code that demonstrates how continuation byte manifests" has absolutely nothing to do with continuation bytes or your error... — T S, Aug 18 '19 at 21:28
@TS I will post a question about `decode` while working with DataFrame as a separate question. In scope of this question I looking forward to find out why `print(string_from_3_ascii_chars.startswith(str(0xe6)))` prints `True`. — Oleg Zhylin, Aug 19 '19 at 06:11
@OlegZhylin because `str(0xe6)` and `chr(50) + chr(51) + chr(48)` are both the same way to construct the string `'230'`, which is a string with three ASCII characters. Why would you expect `'230'.startswith('230')` to return something else than `True`? — lenz, Aug 19 '19 at 16:45
Wow @lenz! Could you please elaborate how `str(0xe6)` becomes a 3 ASCII characters? Looks like a single byte to me... — Oleg Zhylin, Aug 19 '19 at 21:37
@OlegZhylin unlike [chr(...)](https://docs.python.org/3/library/functions.html#chr) (which does interpret its argument as numerical codepoint value), [str(...)](https://docs.python.org/3/library/stdtypes.html#str) just converts its argument to a string by calling `object.__str__()` or `repr(object)` - for numeric arguments, this just results in a string containing the decimal representation in ASCII. Therefore [`str(230)=='230'`](https://onlinegdb.com/rJIAPidEH). — T S, Aug 19 '19 at 22:51
@OlegZhylin If this question should actually be about "why `print(string_from_3_ascii_chars.startswith(str(0xe6)))` prints `True`" then I think you should remove the error message and stuff about Pandas and Excel from the question, because in it's current state the question doesn't make much sense... — T S, Aug 20 '19 at 12:43
Possible duplicate of https://stackoverflow.com/questions/10611455/what-is-character-encoding-and-why-should-i-bother-with-it; see also https://meta.stackoverflow.com/questions/379403/problematic-questions-about-decoding-errors — tripleee, Sep 15 '19 at 08:40

Oleg Zhylin · Accepted Answer · 2019-08-20T13:12:02.437

0

In the example in question str(0xe6) takes an integer 0xe6 (230 in decimal notation) and calls repr(object) on it. This produces string '230'. string_from_3_ascii_chars does start with '230'. startswithconfirms this by returning True.

edited Aug 20 '19 at 13:12

answered Aug 19 '19 at 23:06

Oleg Zhylin

1,290
12
18

2

"startswith function always takes a string as an argument" is correct, but "When passed an integer, it converts it" is not true. `startswith` does not automatically convert its argument to a string, it produces an [error](https://onlinegdb.com/SyHeW3dNB) for wrong argument types. You manually converted `0xe6` to a string, by calling `str(0xe6)` – T S Aug 19 '19 at 23:34
@TS ok, I'll think about how to rephrase the question. The challenge is to preserve context for all the comments. – Oleg Zhylin Aug 20 '19 at 20:19

score 0 · Answer 2 · edited Sep 15 '19 at 08:37

It is possible by detaching encoder and replacing by your ascii encoder.

create a sample file encoded in Latin-1/ascii encoding.
open file with "utf-8" encoding
detach encoding and replace by "Latin-1/ascii"
read file

Note: This method changes file permissions. So you will be able to read but not write to file.

with open("/Desktop/temp/junk1",'wb') as f: 
    s="Hello Jalapeño".encode("latin-1") 
    f.write(s)
with open("/Desktop/temp/junk1",'r') as f: 
    b=f.detach()
    f=io.TextIOWrapper(b,"latin-1") 
    print(f.read())

`

Python misinterprets 3 character string as UTF-8 continuation byte

2 Answers2