0

I'm reading some strings from a text file. Some of these strings have some "strange" characters, e.g. "\xc3\xa9comiam". If I copy that string and paste it into a variable, I can convert it to readable characters:

string = "\xc3\xa9comiam"
print(string.encode("raw_unicode_escape").decode('utf-8'))
écomiam

but if I read it from the file, it doesn't work:

with open(fn) as f:
       for string in f.readlines():
          print(string.encode("raw_unicode_escape").decode('utf-8'))
\xc3\xa9comiam

It seems the solution must be pretty easy, but I can't find it. What can I do?

Thanks!

2 Answers2

0

Those not unicode-escape ones - like the name suggests, that handles Unicode sequences like \u00e9 but not \xe9.

What you have is a UTF-8 enooded sequence. The way to decode that is to get it into a bytes sequence which can then be decoded to a Unicode string.

# Let's not shadow the string library
s = "\xc3\xa9comiam"
print(bytes(s, 'latin-1').decode('utf-8'))

The 'latin-1' trick is a dirty secret which simply converts every byte to a character with the same character code.

For your file, you could open it in binary mode so you don't have to explictly convert it to bytes, or you could simply apply the same conversion to the strings you read.

tripleee
  • 175,061
  • 34
  • 275
  • 318
0

Thanks everyone for your help,

I think, I've found a solution (not very elegant, but it does the trick).

print(bytes(tm.strip(), "utf-8").decode("unicode_escape").encode("raw_unicode_escape").decode('utf-8'))

Thanks!