Python3 : unescaping non ascii characters

Question

(Python 3.3.2) I have to unescape some non ASCII escaped characters returned by a call to re.escape(). I see here and here methods that doesn't work. I'm working in a 100% UTF-8 environment.

# pure ASCII string : ok
mystring = "a\n" # expected unescaped string : "a\n"
cod = codecs.getencoder('unicode_escape')
print( cod(mystring) )

# non ASCII string : method #1
mystring = "€\n"
# equivalent to : mystring = codecs.unicode_escape_decode(mystring)
cod = codecs.getdecoder('unicode_escape')
print(cod(mystring))
# RESULT = ('â\x82¬\n', 5) INSTEAD OF ("€\n", 2)

# non ASCII string : method #2
mystring = "€\n"
mystring = bytes(mystring, 'utf-8').decode('unicode_escape')
print(mystring)
# RESULT = â\202¬ INSTEAD OF "€\n"

Is this a bug ? Have I misunderstood something ?

Any help would be appreciated !

PS : I edited my post thanks to the Michael Foukarakis' remark.

`"€\\n"` is not a Unicode escaped string, so you can not decode it to anything meaningful. `"€\n"`, if it were Unicode escaped, would become `b'\\u20ac\\n'`. So yeah, you seem to have misunderstood encodings. — Michael Foukarakis, Aug 28 '13 at 14:03
A good point : I edited my post. But my problem is the same with the (non unicode) € character. — suizokukan, Aug 28 '13 at 14:08
badcOre > the output is stored in a file and is printed in a terminal (urxvt). — suizokukan, Aug 28 '13 at 14:51

YiguoDada · Answer 1 · 2015-12-23T16:05:04.063

I guess the actual string you need to process is mystring = €\\n?

mystring = "€\n"  # that's 2 char, "€" and new line
mystring = "€\\n" # that's 3 char, "€", "\" and "n"

I don't really understand what's going wrong within encode() and decode() of python3, but my friend solve this problem when we are writing some tools.

How we did is to bypass the encoder("utf_8") after the escape procedure is done.

>>> "€\\n".encode("utf_8")
b'\xe2\x82\xac\\n'
>>> "€\\n".encode("utf_8").decode("unicode_escape")
'â\x82¬\n'
>>> "€\\n".encode("utf_8").decode("unicode_escape").encode("utf_8")
b'\xc3\xa2\xc2\x82\xc2\xac\n'  # we don't want this
>>> bytes([ord(char) for char in "€\\n".encode("utf_8").decode("unicode_escape")])
b'\xe2\x82\xac\n'  # what we really need
>>> str(bytes([ord(char) for char in "€\\n".encode("utf_8").decode("unicode_escape")]), "utf_8")
'€\n'

We can see that: though the result of decode("unicode_escape") looks wired, the bytes object actually contain the correct bytes of your strings(with utf-8 encoding), in this case, "\xe2\x82\xac\n"

And we now do not print the str object directly, neither do we use encode("utf_8"), we use ord() to create the bytes object b'\xe2\x82\xac\n'.

And you can get the correct str from this bytes object, just put it into str()

BTW, the tool my friend and me want to make is a wrapper that allow user to input c-like string literal, and convert the escaped sequence automatically.

User input:\n\x61\x62\n\x20\x21  # 20 characters, which present 6 chars semantically
output:  # \n
ab       # \x61\x62\n
 !       # \x20\x21

That's a powerful tool for user to input some non-printable character in terminal.

Our final tools is:

#!/usr/bin/env python3
import sys 

for line in sys.stdin:
    sys.stdout.buffer.write(bytes([ord(char) for char in line[:-1].encode().decode('unicode_escape')]))
    sys.stdout.flush()

score 1 · Answer 2 · answered Aug 28 '13 at 14:31

1

You seem to misunderstand encodings. To be protected against common errors, we usually encode a string when it leaves our application, and decode it when it comes in.

Firstly, let's look at the documentation for unicode_escape, which states:

Produce[s] a string that is suitable as Unicode literal in Python source code.

Here is what you would get from the network or a file that claims its contents are Unicode escaped:

b'\\u20ac\\n'

Now, you have to decode this to use it in your app:

>>> s = b'\\u20ac\\n'.decode('unicode_escape')
>>> s
'€\n'

and if you wanted to write it back to, say, a Python source file:

with open('/tmp/foo', 'wb') as fh: # binary mode
    fh.write(b'print("' + s.encode('unicode_escape') + b'")')

answered Aug 28 '13 at 14:31

Michael Foukarakis

39,737
6
87
123

Thank you for your answer. My "encoded" string ("\€\n" by example) has a very Pythonic origin : it's the value returned by a call to re.escape(). As far as I known there's no inverse function such as re.unescape(). Hence my attempt to decode the "escaped" string. How can I achieve that ? – suizokukan Aug 28 '13 at 14:50
The answer to the question "which is the suitable encoding?" depends on how it is going to be used. So, what is your use case? Also, are you sure `re.escape` is necessary, i.e. are you using user input as a regex? – Michael Foukarakis Aug 28 '13 at 14:52
These strings are read from a UTF-8 encoded file and will be written as UTF-8 strings in another file. Luckily, I don't mix different encodings. – suizokukan Aug 28 '13 at 14:54

score 0 · Answer 3 · answered Aug 28 '13 at 16:30

import string
printable = string.printable
printable = printable + '€'

def cod(c):
    return c.encode('unicode_escape').decode('ascii')

def unescape(s):
    return ''.join(c if ord(c)>=32 and c in printable else cod(c) for c in s)

mystring = "€\n"
print(unescape(mystring))

Unfortunately string.printable only includes ASCII characters. You can make a copy as I did here and extend it with any Unicode characters that you'd like, such as €.

Python3 : unescaping non ascii characters

3 Answers3