4

Conclusion: It's impossible to override or disable Python's built-in escape sequence processing, such that, you can skip using the raw prefix specifier. I dug into Python's internals to figure this out. So if anyone tries designing objects that work on complex strings (like regex) as part of some kind of framework, make sure to specify in the docstrings that string arguments to the object's __init__() MUST include the r prefix!




Original question: I am finding it a bit difficult to force Python to not "change" anything about a user-inputted string, which may contain among other things, regex or escaped hexadecimal sequences. I've already tried various combinations of raw strings, .encode('string-escape') (and its decode counterpart), but I can't find the right approach.

Given an escaped, hexadecimal representation of the Documentation IPv6 address 2001:0db8:85a3:0000:0000:8a2e:0370:7334, using .encode(), this small script (called x.py):

#!/usr/bin/env python

class foo(object):
    __slots__ = ("_bar",)
    def __init__(self, input):
        if input is not None:
            self._bar = input.encode('string-escape')
        else:
            self._bar = "qux?"

    def _get_bar(self): return self._bar
    bar = property(_get_bar)
#

x = foo("\x20\x01\x0d\xb8\x85\xa3\x00\x00\x00\x00\x8a\x2e\x03\x70\x73\x34")
print x.bar


Will yield the following output when executed:

$ ./x.py
 \x01\r\xb8\x85\xa3\x00\x00\x00\x00\x8a.\x03ps4


Note the \x20 got converted to an ASCII space character, along with a few others. This is basically correct due to Python processing the escaped hex sequences and converting them to their printable ASCII values.


This can be solved if the initializer to foo() was treated as a raw string (and the .encode() call removed), like this:

x = foo(r"\x20\x01\x0d\xb8\x85\xa3\x00\x00\x00\x00\x8a\x2e\x03\x70\x73\x34")


However, my end goal is to create a kind of framework that can be used and I want to hide these kinds of "implementation details" from the end user. If they called foo() with the above IPv6 address in escaped hexadecimal form (without the raw specifier) and immediately print it back out, they should get back exactly what they put in w/o knowing or using the raw specifier. So I need to find a way to have foo's __init__() do whatever processing is necessary to enable that.



Edit: Per this SO question, it seems it's a defect of Python, in that it always performs some kind of escape sequence processing. There does not appear to be any kind of facility to completely turn off escape sequence processing, even temporarily. Sucks. I guess I am going to have to research subclassing str to create something like rawstr that intelligently determines what escape sequences Python processed in a string, and convert them back to their original format. This is not going to be fun...


Edit2: Another example, given the sample regex below:

"^.{0}\xcb\x00\x71[\x00-\xff]"


If I assign this to a var or pass it to a function without using the raw specifier, the \x71 gets converted to the letter q. Even if I add .encode('string-escape') or .replace('\\', '\\\\'), the escape sequences are still processed. thus resulting in this output:

"^.{0}\xcb\x00q[\x00-\xff]"


How can I stop this, again, without using the raw specifier? Is there some way to "turn off" the escape sequence processing or "revert" it after the fact thus that the q turns back into \x71? Is there a way to process the string and escape the backslashes before the escape sequence processing happens?

Community
  • 1
  • 1
Kumba
  • 2,390
  • 3
  • 33
  • 60
  • I don't understand. If you want it as hex, then why not just print it out as hex? – Ignacio Vazquez-Abrams Dec 30 '12 at 01:53
  • I am double-checking a few things, but I basically want what gets inputted to be able to be print back out without any changes. I don't think this is an issue with reading from a file, so I need to go back and edit my question. However, programmatic input might still get mangled. – Kumba Dec 30 '12 at 01:55
  • Oh, and I don't want to convert to hex, I just want the escaped sequences left intact. Think if regex was supplied, I don't want python to convert things to printable ASCII equivalents. so `\x30` should not get converted to `0` when its printed back out, it should stay as `\x30`. Sorry if I am not clear on that – Kumba Dec 30 '12 at 01:57
  • 1
    The original representation is lost, since many hex codes map **directly** to characters in a given encoding, therefore there is no such thing as "without any changes" since nothing has changed. – Ignacio Vazquez-Abrams Dec 30 '12 at 01:58
  • True, but if you initialized `x` as a raw string, it won't convert anything. Trying to do the same for something passed in a function w/o the user needing to do anything special. I kinda wish there was a `raw()` function that behaved like `r""`, except you passed it a variable to convert to a raw string. – Kumba Dec 30 '12 at 02:01

2 Answers2

2

I think you have an understandable confusion about a difference between Python string literals (source code representation), Python string objects in memory, and how that objects can be printed (in what format they can be represented in the output).

If you read some bytes from a file into a bytestring you can write them back as is.

r"" exists only in source code there is no such thing at runtime i.e., r"\x" and "\\x" are equal, they may even be the exact same string object in memory.

To see that input is not corrupted, you could print each byte as an integer:

print " ".join(map(ord, raw_input("input something")))

Or just echo as is (there could be a difference but it is unrelated to your "string-escape" issue):

print raw_input("input something")

Identity function:

def identity(obj):
    return obj

If you do nothing to the string then your users will receive the exact same object back. You can provide examples in the docs what you consider a concise readable way to represent input string as Python literals. If you find confusing to work with binary strings such as "\x20\x01" then you could accept ascii hex-representation instead: "2001" (you could use binascii.hexlify/unhexlify to convert one to another).


The regex case is more complex because there are two languages:

  1. Escapes sequences are interpreted by Python according to its string literal syntax
  2. Regex engine interprets the string object as a regex pattern that also has its own escape sequences
jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • Yeah, I think I am. It's annoying a bit -- with this framework I hope to eventually complete, in some of my classes, I'd like to disable the escape sequence processing completely so the user, whether in the interpreter or writing their own Python scripe via the framework, wouldn't have to worry about the whole raw string bit at all. If they used an object that disables escape sequence processing and passed it a string containing escape sequences, they should get the exact same string right back out. – Kumba Dec 30 '12 at 05:20
  • 1
    Unfortunately, this does not appear to be possible. I am digging through PyPy's source on the hopes of finding some obscure "switch" or something, but I think in the end, I will have to just stress in the documentation for such classes that ALL input strings HAVE TO be prefixed with `r` to avoid errors. – Kumba Dec 30 '12 at 05:22
  • To add, if it were possible to escape backslashes in a string (either with `encode` or `replace`) **before** Python escape sequence processing did its thing, I think that would also solve the problem I've created. That way, `\x71` gets converted to `\\x71` and doesn't get converted into `q` by the escape sequence processing. Unfortunately, I don't think this is possible either. Sound about right? – Kumba Dec 30 '12 at 05:25
  • @Kumba: I've added a note about regexes – jfs Dec 30 '12 at 06:15
  • Sebatian: I dug through Python-2.7.3's source and located the `parsestr()` function that handles the string literal prefixes. Doesn't look like there is a good way to override them, however. I think I am just going to have to stress in the class documentation that `r` is required for this specific object. It's a pity Python wasn't simply like Bash, where single quotes inhibit all escape sequence processing entirely while double quotes don't. – Kumba Dec 30 '12 at 06:24
  • Okay, you get the points. Doesn't technically address my question, but it's been open long enough to not bother deleting it, and maybe something will make sense of this mess down the road. I was able to play with my `__repr__()` function to emit out the preferred format for the object in question, including the `r` prefix. It's still not pretty, since that object eats regex, but it'll have to do. Thanks! – Kumba Dec 30 '12 at 06:47
  • @Kumba: one more attempt: 1. do nothing with the input (your code works with string objects, forget about literals) 2. what `print` shows is just some representation of the object, you can customize it: `__str__`, `__repr__` and if possible then define `__repr__` so that: `eval(repr(obj)) == obj` and the representation is unambiguous for a human (just combing the class name with `repr(tuple(init_args))` might be enough). – jfs Dec 30 '12 at 07:24
  • 1
    As far as I can tell, there is no way to know which sequences Python escaped and then revert them back when printing output. Basically, my hope WAS that, if in a class (or in a module), if you set some kind of property, then `len("\x71") == len(r"\x71")`. No such property exists, so the above evaluation will be false because the raw string will have four characters in it while the non-raw string will have 1. Even if you added `.encode('string-escape')`, the escape processing happens when the string is stored to memory, well before `.encode` is ran. – Kumba Jan 01 '13 at 03:02
0

I think you will have to go the join route.

Here's an example:

>>> m = {chr(c): '\\x{0}'.format(hex(c)[2:].zfill(2)) for c in xrange(0,256)}
>>>
>>> x = "\x20\x01\x0d\xb8\x85\xa3\x00\x00\x00\x00\x8a\x2e\x03\x70\x73\x34"
>>> print ''.join(map(m.get, x))
\x20\x01\x0d\xb8\x85\xa3\x00\x00\x00\x00\x8a\x2e\x03\x70\x73\x34

I'm not entirely sure why you need that though. If your code needs to interact with other pieces of code, I'd suggest that you agree on a defined format, and stick to it.

Thomas Orozco
  • 53,284
  • 11
  • 113
  • 116
  • I don't think I was as clear as I initially intended. Trying to rethink my wording and find a better example. – Kumba Dec 30 '12 at 02:01
  • Reload. I think I got across what I want to do now. Thanks for being patient! – Kumba Dec 30 '12 at 02:17