14

I want to convert strings containing escaped characters to their normal form, the same way Python's lexical parser does:

>>> escaped_str = 'One \\\'example\\\''
>>> print(escaped_str)
One \'Example\'
>>> normal_str = normalize_str(escaped_str)
>>> print(normal_str)
One 'Example'

Of course the boring way will be to replace all known escaped characters one by one: http://docs.python.org/reference/lexical_analysis.html#string-literals

How would you implement normalize_str() in the above code?

martineau
  • 119,623
  • 25
  • 170
  • 301
aligf
  • 2,020
  • 4
  • 19
  • 33

4 Answers4

26
>>> escaped_str = 'One \\\'example\\\''
>>> print escaped_str.encode('string_escape')
One \\\'example\\\'
>>> print escaped_str.decode('string_escape')
One 'example'

Several similar codecs are available, such as rot13 and hex.

The above is Python 2.x, but – since you said (below, in a comment) that you're using Python 3.x – while it's circumlocutious to decode a Unicode string object, it's still possible. The codec has been renamed to "unicode_escape" too:

Python 3.3a0 (default:b6aafb20e5f5, Jul 29 2011, 05:34:11) 
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> escaped_str = "One \\\'example\\\'"
>>> import codecs
>>> print(codecs.getdecoder("unicode_escape")(escaped_str)[0])
One 'example'
Fred Nurk
  • 13,952
  • 4
  • 37
  • 63
  • 1
    One good turn deserves another :) I once found that I could elegantly solve a problem by writing my own string codec, FWIW. – Karl Knechtel Jul 29 '11 at 03:10
  • 1
    This approach doesn't seem to work in Python 3. I get: AttributeError: 'str' object has no attribute 'decode'. – aligf Jul 29 '11 at 17:22
  • 1
    in python 3, `str` is `bytes` and `unicode` is `str`. You probably need to first 'encode' into utf8 or ascii (to get the bytes) then decode from 'string_escape' – SingleNegationElimination Jul 29 '11 at 17:38
  • Does this work with \t tab? I couldn't get it to, nor did printing directly to a string buffer, or redirecting stdout to a buffer. The only thing that did it was expandtabs. – J B May 16 '22 at 19:39
6

SingleNegationElimination already mentioned this, but here is an example:

In Python 3:

>>>escaped_str = 'One \\\'example\\\''
>>>print(escaped_str.encode('ascii', 'ignore').decode('unicode_escape'))
One 'example'
Nicolai Lissau
  • 7,298
  • 5
  • 43
  • 57
  • For SSIDs as obtained from `iw wlan0 scan`, this gave me encoding errors. Solved that with: `print(ssid.encode().decode('unicode_escape').encode('latin1').decode('utf-8'))` --- Thanks for setting me on the right track, Attaque! – Luc Apr 03 '21 at 02:14
6

I assume the question is really:

I have a string that is formatted as if it were a part of Python source code. How can I safely interpret it so that \n within the string is transformed into a newline, quotation marks are expected on either end, etc. ?

Try ast.literal_eval.

>>> import ast
>>> print ast.literal_eval(raw_input())
"hi, mom.\n This is a \"weird\" string, isn't it?"
hi, mom.
 This is a "weird" string, isn't it?

For comparison, going the other way:

>>> print repr(raw_input())
"hi, mom.\n This is a \"weird\" string, isn't it?"
'"hi, mom.\\n This is a \\"weird\\" string, isn\'t it?"'
Karl Knechtel
  • 62,466
  • 11
  • 102
  • 153
  • 3
    literal_eval requires a valid string literal, including begin/end quotes. Adding the quotes (the sample in the question doesn't have them) has several edge cases, depending on what type of input you want to accept. – Fred Nurk Jul 29 '11 at 02:29
  • 1
    @Fred very true; but I imagine that in most situations where this is really the problem you want to solve, the begin/end quotes are actually there, even if OP left them out of the example. :) – Karl Knechtel Jul 29 '11 at 02:57
  • 1
    I'm not sure that really is the problem you'd always want to solve: I'd guess the string_escape codec (as in my answer) exists to fill the real need of transforming escapes without having a string literal. (Pointing out literal_eval is still useful though; I'm the upvote. ;) – Fred Nurk Jul 29 '11 at 03:00
  • this fails with prefixed/suffixed space/tab characters since that makes it invalid in python – ccpizza May 05 '22 at 11:49
  • Well, yes; those weren't part of the specification. "a string that is formatted as if it were a part of Python source code", inherently, begins and ends with a matching pair of either `'` or `"`, not whitespace. If you want to handle that, though, it's trivial to `.strip()` off first. – Karl Knechtel May 05 '22 at 12:36
0

Unpaired backslashes are just artifacts of the representation and not actually stored internally. You could cause errors if trying to do this manually.

If your only interest is removing a backslash not preceded by an odd amount of backslashes, you could try a while loop:

escaped_str = 'One \\\'example\\\''
chars = []
i = 0
while i < len(escaped_str):
    if i == '\\':
        chars.append(escaped_str[i+1])
        i += 2
    else:
        chars.append(escaped_str[i])
        i += 1
fixed_str = ''.join(chars)
print fixed_str

Examine your variables afterwards and you'll see why what you're trying to do doesn't make sense.

...But on a side note I'm almost 100% certain "the same way Python's lexical parser" does it is not using a parser, so to speak. A parser is for grammars, which describe the way you fit words together.

You're thinking of lexical content verification maybe, which is often specified using regular expressions. Parsers are an altogether more challenging and powerful beast, and not something you want to mess around with for the purposes of linear string manipulation.

machine yearning
  • 9,889
  • 5
  • 38
  • 51
  • 2
    What the OP calls a "lexical parser" might more accurately be termed a **lexer**, which Python certainly does have. Fortunately, we don't have to re-invent it; it's reflected in some detail - see my answer. – Karl Knechtel Jul 29 '11 at 02:06