Process escape sequences in a string in Python

Question

Sometimes when I get input from a file or the user, I get a string with escape sequences in it. I would like to process the escape sequences in the same way that Python processes escape sequences in string literals.

For example, let's say myString is defined as:

>>> myString = "spam\\neggs"
>>> print(myString)
spam\neggs

I want a function (I'll call it process) that does this:

>>> print(process(myString))
spam
eggs

It's important that the function can process all of the escape sequences in Python (listed in a table in the link above).

Does Python have a function to do this?

hmmm, how exactly would you expect a string containing `'spam'+"eggs"+'''some'''+"""more"""` to be processed? — Nas Banov, Oct 26 '10 at 05:05
@Nas Banov That's a good test. That string contains no escape sequences, so it should be exactly the same after processing. `myString = "'spam'+\"eggs\"+'''some'''+\"\"\"more\"\"\""`, `print(bytes(myString, "utf-8").decode("unicode_escape"))` seems to work. — dln385, Oct 26 '10 at 06:11
Most answers to this question have serious problems. There seems to be no standard way to honor escape sequences in Python without breaking unicode. The answer posted by @rspeer is the one that I adopted for [Grako](https://pypi.python.org/pypi/grako/) as it so far handles all known cases. — Apalala, Jul 01 '14 at 22:59
I disagree with Apalala; using unicode_escape (on a properly latin1-encoded input) is completely reliable, and as the issue that Hack5 links to in his comment to user19087's answer shows, is the method recommended by the python developers. — Glen Whitney, Dec 17 '20 at 01:41
Does this answer your question? [How to un-escape a backslash-escaped string?](https://stackoverflow.com/questions/1885181/how-to-un-escape-a-backslash-escaped-string) — Glen Whitney, Feb 12 '21 at 04:06
Related: [how do I .decode('string-escape') in Python3?](https://stackoverflow.com/questions/14820429/how-do-i-decodestring-escape-in-python3) — SuperStormer, Feb 23 '22 at 00:36
Related: https://stackoverflow.com/questions/63218987/convert-x-escaped-string-into-readable-string-in-python — Karl Knechtel, Aug 05 '22 at 01:52
Related: https://stackoverflow.com/questions/43662474/reversing-pythons-re-escape — Karl Knechtel, Aug 05 '22 at 02:40
Note that most of these approaches will work with `bytes` input - for the ones that involve converting to `bytes` first, just skip that step. Similarly, `str` output can be converted to `bytes` if needed by simply using an appropriate encoding - `latin-1` is probably what you want. — Karl Knechtel, Aug 06 '22 at 01:00
For the opposite problem - converting from "special" characters into escape sequences - see [Python print string like a raw string](https://stackoverflow.com/questions/26520111). However, note that this is **not a round-trip conversion**; there are multiple ways to represent a given string with escape sequences, and only one of them is particularly easy to get. — Karl Knechtel, Aug 07 '22 at 09:40
Maybe you can try using `eval`? For example, `print(eval('"spam\\neggs"'))` prints your desired output, and of course you may need to add/adjust some quotes before/after your original string. — Shuo Ding, Jun 03 '23 at 19:59

Jerub · Accepted Answer · 2010-10-26T06:29:28.333

177

The correct thing to do is use the 'string-escape' code to decode the string.

>>> myString = "spam\\neggs"
>>> decoded_string = bytes(myString, "utf-8").decode("unicode_escape") # python3 
>>> decoded_string = myString.decode('string_escape') # python2
>>> print(decoded_string)
spam
eggs

Don't use the AST or eval. Using the string codecs is much safer.

edited Oct 26 '10 at 06:29

answered Oct 26 '10 at 05:01

Jerub

41,746
15
73
90

3

hands down, the **best** solution! btw, by docs it should be "string_escape" (with underscore) but for some reason accepts anything in the pattern 'string escape', 'string@escape" and whatnot... basically `'string\W+escape'` – Nas Banov Oct 26 '10 at 05:18
2

@Nas Banov The documentation does [make a small mention about that](http://docs.python.org/library/codecs.html#standard-encodings): `Notice that spelling alternatives that only differ in case or use a hyphen instead of an underscore are also valid aliases; therefore, e.g. 'utf-8' is a valid alias for the 'utf_8' codec.` – dln385 Oct 26 '10 at 05:44
1

In Python 3, the command needs to be `print(bytes(myString, "utf-8").decode("unicode_escape"))` – dln385 Oct 26 '10 at 06:06
@dln385 Does it work with non-ascii characters? I have some non-ascii chars with \\t. In python2, string-escape just works for that. But in python3, the codec is removed. And the unicode-escape just escapes all non-ascii bytes and breaks my encoding. – Ning Sun Feb 17 '12 at 09:59
In Python 2.7, myStr.decode('unicode_escape') seems better than myStr.decode('string_escape'), because it will also unescape unicode \udddd escape sequences into actual unicode characters. For example, r"\u2014").decode('unicode_escape') yields u"\u2014". string_escape, in contrast, leaves unicode escapes untouched. Though note that (at least in my locale) while I can put non-ASCII unicode *escapes* in myStr, I can't put actual non-ASCII *characters* in myStr, or decode will give me "UnicodeEncodeError: 'ascii' codec can't encode character" problems. – Chris May 14 '13 at 08:44
37

This solution is not good enough because it doesn't handle the case in which there are legit unicode characters in the original string. If you try: ``>>> print("juancarlo\\tañez".encode('utf-8').decode('unicode_escape'))`` You get: ``juancarlo aÃ±ez`` – Apalala Jul 01 '14 at 19:04
3

Agreed with @Apalala: this is not good enough. Check out rseeper's answer below for a complete solution that works in Python2 and 3! – Christian Aichinger Mar 28 '16 at 03:26
3

Since `latin1` is assumed by `unicode_escape`, redo the encode/decode bit, e.g. `s.encode('utf-8').decode('unicode_escape').encode('latin1').decode('utf8')` – metatoaster May 25 '18 at 09:01
@metatoaster As stated in my answer, that doesn't work if your string contains any characters that aren't in latin-1. – rspeer Jul 06 '18 at 03:39
@rspeer the whole string when being decoded as `unicode_escape` is `bytes`, which means it doesn't have any encoding, but `unicode_escape` is a valid codec which would produce the same `bytes` as `unicode` encoded in `latin1` from the input string. For ease of illustration please look at this [example](https://gist.github.com/metatoaster/c94ea8a33284f80e1b83f66c16c9b6d0) and see how that actually works through every single step (to ease the effort from having to manually try it on your end). Hence I said "redo the encode/decode bit". – metatoaster Jul 06 '18 at 05:19
@metatoaster Oh, I see! Yes, that actually does work. Nice. – rspeer Jul 10 '18 at 02:41
Just wanted to note that metatoaster is correct, unicode_escape does need a latin-1 coded byte sequence, but it's not necessary to make two roundtrips between strings and byte sequences (see alternate answer for python3). – Glen Whitney Dec 17 '20 at 01:38
@metatoaster But isn't your solution still a bit fragile, since `s.encode('utf-8')` encodes the output in utf-8 and `decode('unicode_escape')` assumes the input is latin-1? Is it possible that the utf-8 encoding introduces some backslash bytes? It would probably work fine most of the time, but if the input string included a unicode character that when utf-8 encoded included a `0x5c` latin-1 backslash character, that backslash would get escaped, which would then probably break the final `decode('utf-8')`. – Donovan Baarda Mar 21 '22 at 11:30
2

@DonovanBaarda no, there are no multi-byte `utf-8` representation of any unicode codepoints > 127 that produce `bytes` within the `ascii` range (0-127), as all multi-byte characters are in the range 128-255 (i.e. `0x80` - `0xff`) because the designers of unicode and utf-8 understood this exact issue. In other words, no, it is impossible to for `str.encode('utf-8')` to produce the `bytes` `b'\x5c'` (`0x5c`) from anything other than the unicode codepoint `U+005C`. – metatoaster Mar 22 '22 at 03:24
Tried using `codecs.decode(myString, 'unicode-escape')`, since `codecs.decode` accepts Unicode input directly. Turns out that *still* fails on input outside the ASCII range, in the exact same way Apalala pointed out the current version of the answer already fails. – user2357112 Aug 05 '22 at 00:19

score 161 · Answer 2 · edited Jul 12 '23 at 14:20

161

`unicode_escape` doesn't work in general

It turns out that the string_escape or unicode_escape solution does not work in general -- particularly, it doesn't work in the presence of actual Unicode.

If you can be sure that every non-ASCII character will be escaped (and remember, anything beyond the first 128 characters is non-ASCII), unicode_escape will do the right thing for you. But if there are any literal non-ASCII characters already in your string, things will go wrong.

unicode_escape is fundamentally designed to convert bytes into Unicode text. But in many places -- for example, Python source code -- the source data is already Unicode text.

The only way this can work correctly is if you encode the text into bytes first. UTF-8 is the sensible encoding for all text, so that should work, right?

The following examples are in Python 3, so that the string literals are cleaner, but the same problem exists with slightly different manifestations on both Python 2 and 3.

>>> s = 'naïve \\t test'
>>> print(s.encode('utf-8').decode('unicode_escape'))
naÃ¯ve   test

Well, that's wrong.

The new recommended way to use codecs that decode text into text is to call codecs.decode directly. Does that help?

>>> import codecs
>>> print(codecs.decode(s, 'unicode_escape'))
naÃ¯ve   test

Not at all. (Also, the above is a UnicodeError on Python 2.)

The unicode_escape codec, despite its name, turns out to assume that all non-ASCII bytes are in the Latin-1 (ISO-8859-1) encoding. So you would have to do it like this:

>>> print(s.encode('latin-1').decode('unicode_escape'))
naïve    test

But that's terrible. This limits you to the 256 Latin-1 characters, as if Unicode had never been invented at all!

>>> print('Ernő \\t Rubik'.encode('latin-1').decode('unicode_escape'))
UnicodeEncodeError: 'latin-1' codec can't encode character '\u0151'
in position 3: ordinal not in range(256)

Adding a regular expression to solve the problem

(Surprisingly, we do now have two problems.)

What we need to do is only apply the unicode_escape decoder to things that we are certain to be ASCII text. In particular, we can make sure only to apply it to valid Python escape sequences, which are guaranteed to be ASCII text.

The plan is, we'll find escape sequences using a regular expression, and use a function as the argument to re.sub to replace them with their unescaped value.

import re
import codecs

ESCAPE_SEQUENCE_RE = re.compile(r'''
    ( \\U........      # 8-digit hex escapes
    | \\u....          # 4-digit hex escapes
    | \\x..            # 2-digit hex escapes
    | \\[0-7]{1,3}     # Octal escapes
    | \\N\{[^}]+\}     # Unicode characters by name
    | \\[\\'"abfnrtv]  # Single-character escapes
    )''', re.UNICODE | re.VERBOSE)

def decode_escapes(s):
    def decode_match(match):
        return codecs.decode(match.group(0), 'unicode-escape')

    return ESCAPE_SEQUENCE_RE.sub(decode_match, s)

And with that:

>>> print(decode_escapes('Ernő \\t Rubik'))
Ernő     Rubik

edited Jul 12 '23 at 14:20

Hakaishin

2,550
3
27
45

answered Jul 01 '14 at 21:12

rspeer

3,539
2
25
25

5

we need more encompassing types of answers like that. thanks. – v.oddou Jan 15 '15 at 05:36
Does this work with `os.sep` at all? I'm trying to do this: `patt = '^' + self.prefix + os.sep ; name = sub(decode_escapes(patt), '', name)` and it's not working. Semicolon is there in place of a new line. – AncientSwordRage Feb 20 '15 at 11:18
1

@Pureferret I'm not really sure what you're asking, but you probably shouldn't run this on strings where the backslash has a different meaning, such as Windows file paths. (Is that what your `os.sep` is?) If you have backslashed escape sequences in your Windows directory names, the situation is pretty much unrecoverable. – rspeer Feb 20 '15 at 22:10
The escape sequence doesn't have escapes in them, but I'm getting a 'bogus escape string ' error – AncientSwordRage Feb 20 '15 at 23:28
That tells me that you ended some other regular expression with a backslash: http://stackoverflow.com/questions/4427174/python-re-bogus-escape-error – rspeer Feb 21 '15 at 05:13
This doesn't work for me, as `unicode-escape` doesn't do the right thing: `test = "\\xe2\\x80\\xa6" test_bytes = test.encode() test = test_bytes.decode("unicode-escape")` Values: `test_bytes` == `b'\\xe2\\x80\\xa6'` `test` == `'â¦'` – Mark Ingram Jun 23 '15 at 12:41
@MarkIngram -- this regular expression is a Unicode regular expression about Unicode escapes, where `\xe2` actually means "unicode character E2" instead of "byte E2". It's not about bytes. If you were able to get it to try to match a byte string, you must have changed the code or used Python 2 coercion. – rspeer Jul 04 '15 at 17:35
@rspeer Did you try my example with Python3? That's what I was using, and that short example doesn't work. – Mark Ingram Jul 07 '15 at 10:17
1

@MarkIngram Yes, I'm using Python 3. I don't understand the relevance of the example you posted, which is doing something unrelated to my code. My code doesn't use bytestrings at any step. – rspeer Jul 13 '15 at 16:10
What you have there, by the way, is a bytestring that's the escape-encoding of another bytestring, which is itself the UTF-8 encoding of some unicode. If you need help decoding it, ask it as a separate question. – rspeer Jul 13 '15 at 16:10
just for us lambda's: `ESCAPE_SEQUENCE_RE.sub(lambda match: codecs.decode(match.group(0), 'unicode-escape'), s)` – TheDiveO Jul 04 '18 at 12:24
Doesn't work for me... the print statement is doing the conversion, not the function itself? – James McCorrie Oct 03 '19 at 08:03
If we are throwing regular expressions at the problem, why include the unicode_escape codec in the solution at all? In that case, just (re)implement the escape conventions directly with a regular expression. But then the approach is not "DRY" -- the language and the reimplementing regexp might diverge. Better to rely only on the language-internal unicode_escape codec, properly applied to a latin-1 encoding as documented. – Glen Whitney Dec 17 '20 at 01:35
@GlenWhitney: That fails on input that cannot be latin-1 encoded. latin-1 only handles a tiny fraction of the full Unicode range. – user2357112 Aug 05 '22 at 00:24
I respectfully disagree; see Karl Knechtel's comment to the answer I posted: "non-latin-1 characters are turned into escape sequences via the 'backslashreplace' error handling." Also what Karl says is true, that solution will fail on input that ends in a backslash, for example, but then the input wasn't actually composed of valid Python escape sequences, so I don't think there is an unambiguous answer. If you have a specific case where using the unicode_escape codec as shown below doesn't work, please comment on the answer I posted and I will be happy to look at it. – Glen Whitney Aug 10 '22 at 20:10

score 43 · Answer 3 · edited May 23 '17 at 12:02

43

The actually correct and convenient answer for python 3:

>>> import codecs
>>> myString = "spam\\neggs"
>>> print(codecs.escape_decode(bytes(myString, "utf-8"))[0].decode("utf-8"))
spam
eggs
>>> myString = "naïve \\t test"
>>> print(codecs.escape_decode(bytes(myString, "utf-8"))[0].decode("utf-8"))
naïve    test

Details regarding codecs.escape_decode:

codecs.escape_decode is a bytes-to-bytes decoder
codecs.escape_decode decodes ascii escape sequences, such as: b"\\n" -> b"\n", b"\\xce" -> b"\xce".
codecs.escape_decode does not care or need to know about the byte object's encoding, but the encoding of the escaped bytes should match the encoding of the rest of the object.

Background:

@rspeer is correct: unicode_escape is the incorrect solution for python3. This is because unicode_escape decodes escaped bytes, then decodes bytes to unicode string, but receives no information regarding which codec to use for the second operation.
@Jerub is correct: avoid the AST or eval.
I first discovered codecs.escape_decode from this answer to "how do I .decode('string-escape') in Python3?". As that answer states, that function is currently not documented for python 3.

edited May 23 '17 at 12:02

Community

1
1

answered May 05 '16 at 20:27

user19087

1,899
1
16
21

This is the real answer (: Too bad it relies upon a poorly-documented function. – jwd Feb 21 '17 at 18:42
6

This is the answer for situations where the escape sequences you have are `\x` escapes of UTF-8 bytes. But because it decodes bytes to bytes, it doesn't -- and can't -- decode any escapes of non-ASCII Unicode characters, such as `\u` escapes. – rspeer Aug 16 '17 at 17:10
3

Just an FYI, this function is technically not public. see https://bugs.python.org/issue30588 – Hack5 Oct 26 '19 at 19:03
Moreover, in the link that Hack5 provides, the python maintainers make it clear that escape_decode may be removed without warning in any future version, and that the "unicode_escape" codec is the recommended way to go about this. – Glen Whitney Dec 17 '20 at 01:29

score 11 · Answer 4 · answered Oct 26 '10 at 03:50

11

The ast.literal_eval function comes close, but it will expect the string to be properly quoted first.

Of course Python's interpretation of backslash escapes depends on how the string is quoted ("" vs r"" vs u"", triple quotes, etc) so you may want to wrap the user input in suitable quotes and pass to literal_eval. Wrapping it in quotes will also prevent literal_eval from returning a number, tuple, dictionary, etc.

Things still might get tricky if the user types unquoted quotes of the type you intend to wrap around the string.

answered Oct 26 '10 at 03:50

Greg Hewgill

951,095
183
1,149
1,285

I see. This seems to be potentially dangerous as you say: `myString = "\"\ndoBadStuff()\n\""`, `print(ast.literal_eval('"' + myString + '"'))` seems to try to run code. How is `ast.literal_eval` any different/safer than `eval`? – dln385 Oct 26 '10 at 04:05
9

@dln385: `literal_eval` never executes code. From the documentation, "This can be used for safely evaluating strings containing Python expressions from untrusted sources without the need to parse the values oneself." – Greg Hewgill Oct 26 '10 at 04:16

score 4 · Answer 5 · answered Dec 17 '20 at 01:26

4

The (currently) accepted answer by Jerub is correct for python2, but incorrect and may produce garbled results (as Apalala points out in a comment to that solution), for python3. That's because the unicode_escape codec requires its source to be coded in latin-1, not utf-8, as per the official python docs. Hence, in python3 use:

>>> myString="špåm\\nëðþ\\x73"
>>> print(myString)
špåm\nëðþ\x73
>>> decoded_string = myString.encode('latin-1','backslashreplace').decode('unicode_escape')
>>> print(decoded_string)
špåm
ëðþs

This method also avoids the extra unnecessary roundtrip between strings and bytes in metatoaster's comments to Jerub's solution (but hats off to metatoaster for recognizing the bug in that solution).

answered Dec 17 '20 at 01:26

Glen Whitney

446
2
12

When I posted this, I did not realize there was a duplicate question for which this exact answer had already been given: https://stackoverflow.com/a/57192592/5583443 – Glen Whitney Feb 12 '21 at 04:00
The important thing here is not just that latin-1 is used, but that non-latin-1 characters are turned into escape sequences via the `'backslashreplace'` error handling. This just happens to give the exact format that the `.decode` step is trying to replace. So this works with, for example, `myString='日本\u8a9e'`, correctly giving `日本語`. However, it doesn't handle the truly nasty cases described in my answer. – Karl Knechtel Aug 06 '22 at 01:07
(On the other hand, it certainly can be argued that input with a single trailing backslash *should* fail...) – Karl Knechtel Aug 06 '22 at 01:10

score 0 · Answer 6 · answered Mar 04 '19 at 22:45

0

This is a bad way of doing it, but it worked for me when trying to interpret escaped octals passed in a string argument.

input_string = eval('b"' + sys.argv[1] + '"')

It's worth mentioning that there is a difference between eval and ast.literal_eval (eval being way more unsafe). See Using python's eval() vs. ast.literal_eval()?

answered Mar 04 '19 at 22:45

LimeTr33

17
2

Just to make sure the warning is up front: **Please do not use `eval` for input that could ever possibly come from outside the program. It allows the user supplying that input to run arbitrary code on your computer. It is not at all trivial to sandbox.** – Karl Knechtel Aug 05 '22 at 01:22

score 0 · Answer 7 · answered Aug 05 '22 at 01:14

Quote the string properly so that it looks like the equivalent Python string literal, and then use ast.literal_eval. This is safe, but much trickier to get right than you might expect.

It's easy enough to add a " to the beginning and end of the string, but we also need to make sure that any " inside the string are properly escaped. If we want fully Python-compliant translation, we need to account for the deprecated behaviour of invalid escape sequences.

It works out that we need to add one backslash to:

any sequence of an even number of backslashes followed by a double-quote (so that we escape a quote if needed, but don't escape a backslash and un-escape the quote if it was already escaped); as well as
a sequence of an odd number of backslashes at the end of the input (because otherwise a backslash would escape our enclosing double-quote).

Here is an acid-test input showing a bunch of difficult cases:

>>> text = r'''\\ \ \" \\" \\\" \'你好'\n\u062a\xff\N{LATIN SMALL LETTER A}"''' + '\\'
>>> text
'\\\\ \\ \\" \\\\" \\\\\\" \\\'你好\'\\n\\u062a\\xff\\N{LATIN SMALL LETTER A}"\\'
>>> print(text)
\\ \ \" \\" \\\" \'你好'\n\u062a\xff\N{LATIN SMALL LETTER A}"\

I was eventually able to work out a regex that handles all these cases properly, allowing literal_eval to be used:

>>> def parse_escapes(text):
...     fixed_escapes = re.sub(r'(?<!\\)(\\\\)*("|\\$)', r'\\\1\2', text)
...     return ast.literal_eval(f'"{fixed_escapes}"')
...

Testing the results:

>>> parse_escapes(text)
'\\ \\ " \\" \\" \'你好\'\nتÿa"\\'
>>> print(parse_escapes(text))
\ \ " \" \" '你好'
تÿa"\

This should correctly handle everything - strings containing both single and double quotes, every weird situation with backslashes, and non-ASCII characters in the input. (I admit it's a bit difficult to verify the results by eye!)

score -3 · Answer 8 · answered Mar 26 '18 at 09:42

-3

Below code should work for \n is required to be displayed on the string.

import string

our_str = 'The String is \\n, \\n and \\n!'
new_str = string.replace(our_str, '/\\n', '/\n', 1)
print(new_str)

answered Mar 26 '18 at 09:42

Vignesh Ramsubbose

71
7

4

This doesn't work as written (the forward slashes make the `replace` do nothing), uses wildly outdated APIs (the `string` module functions of this sort are deprecated as of Python 2.0, replaced by the `str` methods, and gone completely in Python 3), and only handles the specific case of replacing a single newline, not general escape processing. – ShadowRanger Feb 19 '19 at 19:50

Process escape sequences in a string in Python

8 Answers8

`unicode_escape` doesn't work in general

Adding a regular expression to solve the problem

Linked

Related

Process escape sequences in a string in Python

8 Answers8

unicode_escape doesn't work in general

Adding a regular expression to solve the problem

Linked

Related

`unicode_escape` doesn't work in general