Python encoding problem when reading but not when typing

Question

I'm reading some strings from a text file. Some of these strings have some "strange" characters, e.g. "\xc3\xa9comiam". If I copy that string and paste it into a variable, I can convert it to readable characters:

string = "\xc3\xa9comiam"
print(string.encode("raw_unicode_escape").decode('utf-8'))
écomiam

but if I read it from the file, it doesn't work:

with open(fn) as f:
       for string in f.readlines():
          print(string.encode("raw_unicode_escape").decode('utf-8'))
\xc3\xa9comiam

It seems the solution must be pretty easy, but I can't find it. What can I do?

Thanks!

if you open with an encoding parameter, do you have the same issue? `open(fn, encoding="utf-8")` — nickyfot, Apr 05 '19 at 14:49
Possible duplicate: https://stackoverflow.com/questions/10971033/backporting-python-3-openencoding-utf-8-to-python-2 — nickyfot, Apr 05 '19 at 14:51
Does the file contain the characters `\xc3\xa9comiam` in ASCII or the characters `écomiam`? — cdarke, Apr 05 '19 at 14:52
Yes, nickthefreak, I have the same problem and I can't solve it with the solution proposed in https://stackoverflow.com/questions/10971033/backporting-python-3-openencoding-utf-8-to-python-2. — Fernando S. Peregrino, Apr 05 '19 at 14:56
cdarke, if I open the file with a text editor (sublime), I see the ASCII characters (\xc3\xa9comiam) — Fernando S. Peregrino, Apr 05 '19 at 14:57
Just to be certain, you have the characters \ then `x` , then `3`, etc., so the file is not actually in utf8 format? — cdarke, Apr 05 '19 at 15:07
Yes, cdarke. I think I've fixed it with a simple replacement: ```string.replace('\\','').encode("raw_unicode_escape").decode('utf-8')``` — Fernando S. Peregrino, Apr 05 '19 at 15:13
Your first code snippet doesn't work for me, are you using Python 2 or something? — tripleee, Apr 05 '19 at 15:24
See also https://stackoverflow.com/questions/4020539/process-escape-sequences-in-a-string-in-python — cdarke, Apr 05 '19 at 15:27

tripleee · Answer 1 · 2019-04-05T15:44:57.313

Those not unicode-escape ones - like the name suggests, that handles Unicode sequences like \u00e9 but not \xe9.

What you have is a UTF-8 enooded sequence. The way to decode that is to get it into a bytes sequence which can then be decoded to a Unicode string.

# Let's not shadow the string library
s = "\xc3\xa9comiam"
print(bytes(s, 'latin-1').decode('utf-8'))

The 'latin-1' trick is a dirty secret which simply converts every byte to a character with the same character code.

For your file, you could open it in binary mode so you don't have to explictly convert it to bytes, or you could simply apply the same conversion to the strings you read.

Thanks tripleee, but it doesn't work for me. However, I've found a solution based on your suggestion. — Fernando S. Peregrino, Apr 08 '19 at 09:35

score 0 · Accepted Answer · answered Apr 08 '19 at 09:32

0

Thanks everyone for your help,

I think, I've found a solution (not very elegant, but it does the trick).

print(bytes(tm.strip(), "utf-8").decode("unicode_escape").encode("raw_unicode_escape").decode('utf-8'))

Thanks!

answered Apr 08 '19 at 09:32

Fernando S. Peregrino

505
6
13

Python encoding problem when reading but not when typing

2 Answers2