3

I'm loading a file with a bunch of unicode characters (e.g. \xe9\x87\x8b). I want to convert these characters to their escaped-unicode form (\u91cb) in Python. I've found a couple of similar questions here on StackOverflow including this one Evaluate UTF-8 literal escape sequences in a string in Python3, which does almost exactly what I want, but I can't work out how to save the data.

For example: Input file:

\xe9\x87\x8b

Python Script

file = open("input.txt", "r")
text = file.read()
file.close()
encoded = text.encode().decode('unicode-escape').encode('latin1').decode('utf-8')
file = open("output.txt", "w")
file.write(encoded) # fails with a unicode exception
file.close()

Output File (That I would like):

\u91cb

Community
  • 1
  • 1
  • what is `print(open('input.txt', 'rb').read())`? Is it `b'\xe9\x87\x8b'` or `b'\\xe9\\x87\\x8b'`? – jfs Sep 17 '15 at 00:57

3 Answers3

5

You need to encode it again with unicode-escape encoding.

>>> br'\xe9\x87\x8b'.decode('unicode-escape').encode('latin1').decode('utf-8')
'釋'
>>> _.encode('unicode-escape')
b'\\u91cb'

Code modified (used binary mode to reduce unnecessary encode/decodes)

with open("input.txt", "rb") as f:
    text = f.read().rstrip()  # rstrip to remove trailing spaces
decoded = text.decode('unicode-escape').encode('latin1').decode('utf-8')
with open("output.txt", "wb") as f:
    f.write(decoded.encode('unicode-escape'))

http://asciinema.org/a/797ruy4u5gd1vsv8pplzlb6kq

falsetru
  • 357,413
  • 63
  • 732
  • 636
3

\xe9\x87\x8b is not a Unicode character. It looks like a representation of a bytestring that represents Unicode character encoded using utf-8 character encoding. \u91cb is a representation of character in Python source code (or in JSON format). Don't confuse the text representation and the character itself:

>>> b"\xe9\x87\x8b".decode('utf-8')
u'\u91cb' # repr()
>>> print(b"\xe9\x87\x8b".decode('utf-8'))
釋
>>> import unicodedata
>>> unicodedata.name(b"\xe9\x87\x8b".decode('utf-8'))
'CJK UNIFIED IDEOGRAPH-91CB'

To read text encoded as utf-8 from a file, specify the character encoding explicitly:

with open('input.txt', encoding='utf-8') as file:
    unicode_text = file.read()

It is exactly the same for saving Unicode text to a file:

with open('output.txt', 'w', encoding='utf-8') as file:
    file.write(unicode_text)

If you omit the explicit encoding parameter then locale.getpreferredencoding(False) is used that may produce mojibake if it does not correspond to the actual character encoding used to save a file.

If your input file literally contains \xe9 (4 characters) then you should fix whatever software generates it. If you need to use 'unicode-escape'; something is broken.

jfs
  • 399,953
  • 195
  • 994
  • 1,670
1

It looks as if your input file is UTF-8 encoded so specify UTF-8 encoding when you open the file (Python3 is assumed as per your reference):

with open("input.txt", "r", encoding='utf8') as f:
    text = f.read()

text will contain the content of the file as a str (i.e. unicode string). Now you can write it in unicode escaped form directly to a file by specifying encoding='unicode-escape':

with open('output.txt', 'w', encoding='unicode-escape') as f:
    f.write(text)

The content of your file will now contain unicode-escaped literals:

$ cat output.txt
\u91cb
mhawke
  • 84,695
  • 9
  • 117
  • 138