135

Suppose I have a string which is a backslash-escaped version of another string. Is there an easy way, in Python, to unescape the string? I could, for example, do:

>>> escaped_str = '"Hello,\\nworld!"'
>>> raw_str = eval(escaped_str)
>>> print raw_str
Hello,
world!
>>> 

However that involves passing a (possibly untrusted) string to eval() which is a security risk. Is there a function in the standard lib which takes a string and produces a string with no security implications?

martineau
  • 119,623
  • 25
  • 170
  • 301
Nick
  • 21,555
  • 18
  • 47
  • 50
  • 1
    If you have a specific single character (like `'\n'`) you need to un-escape, like I had, you can just do `s.replace('\\n', '\n)`. Not posting an answer because the question is more general but I had a similar problem and didn't want to complicate myself with bytes and encodings so just putting this here for others... – Tomerikoo Jul 21 '21 at 08:18

6 Answers6

159
>>> print '"Hello,\\nworld!"'.decode('string_escape')
"Hello,
world!"
ChristopheD
  • 112,638
  • 29
  • 165
  • 179
  • 24
    Is there something that is compatible with python 3? – thejinx0r Apr 04 '15 at 01:37
  • 6
    @thejinx0r: have a look over here: http://stackoverflow.com/questions/14820429/how-do-i-decodestring-escape-in-python3 – ChristopheD Apr 07 '15 at 08:34
  • 43
    Basically for Python3 you want `print(b"Hello,\nworld!".decode('unicode_escape'))` – ChristopheD Apr 07 '15 at 08:35
  • @ChristopheD Try this `d = re.escape('\w[0-9]')` and then `d.decode('string_escape')`. You dont get the original string. – Amit Tripathi Jul 18 '16 at 09:59
  • 1
    in python 3 `NameError: name 'unicode_escape' is not defined` – ctrl-alt-delor Feb 27 '18 at 22:25
  • @ctrl-alt-delor: You tried to use it as a variable name. It's a string, you need quotes around it. – ShadowRanger Aug 18 '18 at 02:53
  • 10
    For python 3, use `value.encode('utf-8').decode('unicode_escape')` – Casey Kuball Aug 18 '18 at 14:36
  • Note: Unless you're on python 3.7 or newer, this is not an inversion operation to `re.escape`, which prior to 3.7 escapes characters that don't need to be escaped (such as colon `:`). See https://stackoverflow.com/q/51903640/936083. – Casey Kuball Aug 18 '18 at 14:39
  • 19
    **WARNING:** `value.encode('utf-8').decode('unicode_escape')` [corrupts non-ASCII characters in the string](https://bugs.python.org/issue21331). Unless the input is guaranteed to only contain ASCII characters, this is not a valid solution. – Alex Peters Jun 09 '19 at 11:46
  • 1
    So bad that there is no basic string method, because the string method finally makes the escape.. Waste of resources to encode/decode it for a simple unescape.. – gies0r Aug 18 '19 at 23:05
  • 1
    FWIW `value.encode('latin1', errors='backslashescape').decode('unicode_escape')` seems to work... but is perhaps kind of slow? (I can't think of a downside, but my code smell still says it's too janky). – Mark Harviston Jun 16 '21 at 07:06
50

You can use ast.literal_eval which is safe:

Safely evaluate an expression node or a string containing a Python expression. The string or node provided may only consist of the following Python literal structures: strings, numbers, tuples, lists, dicts, booleans, and None. (END)

Like this:

>>> import ast
>>> escaped_str = '"Hello,\\nworld!"'
>>> print ast.literal_eval(escaped_str)
Hello,
world!
jathanism
  • 33,067
  • 9
  • 68
  • 86
  • 3
    Having an escaped semi-colon in the string breaks this code. Throws a syntax error "unexpected character after line continuation character" – darksky Jul 01 '16 at 23:00
  • 3
    @darksky notice that `ast` library requires quotes (either `"` or `'`, even `"""` or `'''`) around your escaped_str, since it is actually trying to run it as Python code but enhances security (prevents string injection) – InQβ Dec 04 '17 at 14:01
  • @no1xsyzy: Which in the OP's case is already the case; this is the correct answer when the `str` is a `repr` of a `str` or `bytes` object as in the OP's case; the `unicode-escape` codec answer is for when it's not a `repr`, but some other form of escaped text (not surrounded by quotes as part of the string data itself). – ShadowRanger Aug 18 '18 at 02:55
  • with utf-8 chars this will not work. checkout the last answer with codes package. it actually works. – rubmz Sep 12 '19 at 18:31
  • FWIW I was attempting to parse some escaped JSON text and kept getting this error `[ERROR] TypeError: string indices must be integers` and this solution worked to solve that. Unescape the string, then parse as JSON. – cyber-monk Aug 19 '20 at 17:43
  • This throws a SyntaxError if the string contains a forward slash – Elliott B Feb 03 '21 at 00:50
47

All given answers will break on general Unicode strings. The following works for Python3 in all cases, as far as I can tell:

from codecs import encode, decode
sample = u'mon€y\\nröcks'
result = decode(encode(sample, 'latin-1', 'backslashreplace'), 'unicode-escape')
print(result)

In recent Python versions, this also works without the import:

sample = u'mon€y\\nröcks'
result = sample.encode('latin-1', 'backslashreplace').decode('unicode-escape')

As suggested by obataku, you can also use the literal_eval method from the ast module like so:

import ast
sample = u'mon€y\\nröcks'
print(ast.literal_eval(F'"{sample}"'))

Or like this when your string really contains a string literal (including the quotes):

import ast
sample = u'"mon€y\\nröcks"'
print(ast.literal_eval(sample))

However, if you are uncertain whether the input string uses double or single quotes as delimiters, or when you cannot assume it to be properly escaped at all, then literal_eval may raise a SyntaxError while the encode/decode method will still work.

starball
  • 20,030
  • 7
  • 43
  • 238
Jesko Hüttenhain
  • 1,278
  • 10
  • 28
  • I don't believe this handles all escaped UTF-8 strings correctly. e.g. starting with `s = '\\xe7\\xa7\\x98'`, python2 `print s.decode('string-escape')` prints `秘` as I'd hope, but your answer in python3 prints `ç§`. [This answer](https://stackoverflow.com/a/58829514/1929012) to another related question in python3 seems to do what I expect: `print(s.encode('latin-1').decode('unicode_escape').encode('latin-1').decode('utf-8'))`. – James Jun 09 '21 at 18:40
  • Hey @James, there can be no universal solution to your problem that would also apply the "correct" encoding, because there is no way to know what that is. In your example, you are expecting UTF-8, but if you were expecting CP1252, for example, your code would clearly fail. However - If you apply my code to the string `s='\\u79d8'`, you will get the character you were looking for! The difference is that your input is the escaped version of _its utf8-encoding_, but the input `s='\\u79d8'` is the escaped version of the _string_. – Jesko Hüttenhain Jun 13 '21 at 01:13
20

In python 3, str objects don't have a decode method and you have to use a bytes object. ChristopheD's answer covers python 2.

# create a `bytes` object from a `str`
my_str = "Hello,\\nworld"
# (pick an encoding suitable for your str, e.g. 'latin1')
my_bytes = my_str.encode("utf-8")

# or directly
my_bytes = b"Hello,\\nworld"

print(my_bytes.decode("unicode_escape"))
# "Hello,
# world"
asachet
  • 6,620
  • 2
  • 30
  • 74
18

For Python3, consider:

my_string.encode('raw_unicode_escape').decode('unicode_escape')

The 'raw_unicode_escape' codec encodes to latin1, but first replaces all other Unicode code points with an escaped '\uXXXX' or '\UXXXXXXXX' form. Importantly, it differs from the normal 'unicode_escape' codec in that it does not touch existing backslashes.

So when the normal 'unicode_escape' decoder is applied, both the newly-escaped code points and the originally-escaped elements are treated equally, and the result is an unescaped native Unicode string.

(The 'raw_unicode_escape' decoder appears to pay attention only to the '\uXXXX' and '\UXXXXXXXX' forms, ignoring all other escapes.)

Documentation: https://docs.python.org/3/library/codecs.html?highlight=codecs#text-encodings

Jander
  • 5,359
  • 1
  • 22
  • 21
  • So am I right in assuming that `s.encode('latin-1', 'backslashreplace')` is the same as `s.encode('raw_unicode_escape')`, or is there some subtle differences that make using `'raw_unicode_escape'` better for this particular application? – Donovan Baarda Mar 21 '22 at 11:53
0

custom string parser to decode only some backslash-escapes, in this case \" and \'

def backslash_decode(src):
    "decode backslash-escapes"
    slashes = 0 # count backslashes
    dst = ""
    for loc in range(0, len(src)):
        char = src[loc]
        if char == "\\":
            slashes += 1
            if slashes == 2:
                dst += char # decode backslash
                slashes = 0
        elif slashes == 0:
            dst += char # normal char
        else: # slashes == 1
            if char == '"':
                dst += char # decode double-quote
            elif char == "'":
                dst += char # decode single-quote
            else:
                dst += "\\" + char # keep backslash-escapes like \n or \t
            slashes = 0
    return dst

src = "a" + "\\\\" + r"\'" + r'\"' + r"\n" + r"\t" + r"\x" + "z" # input
exp = "a" + "\\"   +  "'"  +  '"'  + r"\n" + r"\t" + r"\x" + "z" # expected output

res = backslash_decode(src)

print(res)
assert res == exp
milahu
  • 2,447
  • 1
  • 18
  • 25