3

I have code like this:

def escape_query(query):
    special_chars = ['\\','+','-','&&','||','!','(',')','{','}','[',']',
                     '^','"','~','*','?',':']
    for character in special_chars:
        query = query.replace(character, '\\%s' % character)
    return query

This function should escape all occurrences of every substring (Notice && and ||) in special_characters with backslash.

I think, that my approach is pretty ugly and I couldn't stop wondering if there aren't any better ways to do this. Answers should be limited to standart library.

nagisa
  • 720
  • 5
  • 11
  • note that this will escape twice those already escaped characters , which might not be what you wanted – wim Aug 12 '11 at 01:19
  • \\ equals to \ and \\\\ equals to \\. Just try printing them. – nagisa Aug 12 '11 at 01:24
  • yes...i prefer to use r'\' and r'\\' for such cases as those, for the sake of clarity – wim Aug 12 '11 at 01:28
  • and in my original comment , i meant that `print escape_query('foo()')` will give `foo\(\)`, as expected, but `print escape_query('foo\(\)')` will give `foo\\\(\\\)` - perhaps not expected? – wim Aug 12 '11 at 01:30
  • It's expected. Actually I'm making calls to some API(search) and this function only escapes search query, so to make it search what I need (not `foo` instead of `foo\(\)`), I escape all special characters. – nagisa Aug 12 '11 at 01:36

4 Answers4

2

Using reduce:

def escape_query(query):
  special_chars =  ['\\','+','-','&&','||','!','(',')','{','}','[',']',
                     '^','"','~','*','?',':']
  return reduce(lambda q, c: q.replace(c, '\\%s' % c), special_chars, query)
Ismail Badawi
  • 36,054
  • 7
  • 85
  • 97
2

The following code has exactly the same principle than the steveha's one.
But I think it fulfills your requirement of clarity and maintainability since the special chars are still listed in the same list as yours.

special_chars = ['\\','+','-','&&','||','!','(',')','{','}','[',']',
                 '^','"','~','*','?',':']

escaped_special_chars = map(re.escape, special_chars)

special_chars_pattern = '|'.join(escaped_special_chars).join('()')

def escape_query(query, reg = re.compile(special_chars_pattern) ):
    return reg.sub(r'\\\1',query)

With this code:
when the function definition is executed, an object is created with a value (the regex re.compile(special_chars_pattern) ) received as default argument, and the name reg is assigned to this object and defined as a parameter for the function.
This happens only one time, at the moment when the function definition is executed, which is performed only one time at compilation time.
That means that during the execution of the compiled code that takes place after the compilation, each time a call to the function will be done, this creation and assignement won't be done again: the regex object already exists and is permanantly registered and avalaible in the tuple func_defaults that is definitive attribute of the function.
That's interesting if several calls to the function are done during execution, because Python has not to search for the regex outside if it was defined outside or to reassign it to parameter reg if it was passed as simple argument.

eyquem
  • 26,771
  • 7
  • 38
  • 46
  • Oh, this one's looks definitely better for lot of escapes, but it's a bit slower for small amount of queries (1000) and I don't do very much of them, also it requires importing re module. Used linux's `time` for comparision. – nagisa Aug 12 '11 at 10:54
  • 1
    @nagisa I'm completely astounded by the slowness of my solution. I tested and you're right: 120 milliseconds vs 6 milliseconds for 100 turns of iteration on a text 313 long having 91 special characters among them. I conclude that regexes are slowed down by the presence of numerous '|' in the patterns. So, keep your code ! – eyquem Aug 12 '11 at 13:22
  • Thanks for actually measuring the time. It's interesting to know that Python regexps slow down with lots of '|' in the pattern. If you still have the code you used to time it, would you try the pattern I wrote that uses one character class and then two '|' to handle the "&&" and "||" cases? – steveha Aug 12 '11 at 23:45
  • @steveha I wrote again the code. The tested string isn't the same but has still 313 chars among them 91 special ones => 0.5 ms for nagisa's code, 130.5 ms for my code, 132 ms for your code. I don't understand why time with nagisa's code is so strongly different. The influence of '|' on speed is only hypothetical for me for the moment. – eyquem Aug 13 '11 at 01:16
1

If I understand your requirements correctly, some of the special "chars" are two-character strings (specifically: "&&" and "||"). The best way to do such an odd collection is with a regular expression. You can use a character class to match anything that is one character long, then use vertical bars to separate some alternative patterns, and these can be multi-character. The trickiest part is the backslash-escaping of chars; for example, to match "||" you need to put r'\|\|' because the vertical bar is special in a regular expression. In a character class, backslash is special and so are '-' and ']'. The code:

import re
_s_pat = r'([\\+\-!(){}[\]^"~*?:]|&&|\|\|)'
_pat = re.compile(_s_pat)

def escape_query(query):
    return re.sub(_pat, r'\\\1', query)

I suspect the above is the fastest solution to your problem possible in Python, because it pushes the work down to the regular expression machinery, which is written in C.

If you don't like the regular expression, you can make it easier to look at by using the verbose format, and compile using the re.VERBOSE flag. Then you can sprawl the regular expression across multiple lines, and put comments after any parts you find confusing.

Or, you can build your list of special characters, just like you already did, and run it through this function which will automatically compile a regular expression pattern that matches any alternative in the list. I made sure it will match nothing if the list is empty.

import re
def make_pattern(lst_alternatives):
    if lst_alternatives:
        temp = '|'.join(re.escape(s) for s in lst_alternatives)
        s_pat = '(' + temp + ')'
    else:
        s_pat = '$^' # a pattern that will never match anything
    return re.compile(s_pat)

By the way, I recommend you put the string and the pre-compiled pattern outside the function, as I showed above. In your code, Python will run code on each function invocation to build the list and bind it to the name special_chars.

If you want to not put anything but the function into the namespace, here's a way to do it without any run-time overhead:

import re
def escape_query(query):
    return re.sub(escape_query.pat, r'\\\1', query)

escape_query.pat = re.compile(r'([\\+\-!(){}[\]^"~*?:]|&&|\|\|)')

The above uses the function's name to look up the attribute, which won't work if you rebind the function's name later. There is a discussion of this and a good solution here: how can python function access its own attributes?

(Note: The above paragraph replaces some stuff including a question that was discussed in the discussion comments below.)

Actually, upon further thought, I think this is cleaner and more Pythonic:

import re

_pat = re.compile(r'([\\+\-!(){}[\]^"~*?:]|&&|\|\|)')

def escape_query(query, pat=_pat):
    return re.sub(pat, r'\\\1', query)

del(_pat) # not required but you can do it

At the time escape_query() is compiled, the object bound to the name _pat will be bound to a name inside the function's name space (that name is pat). Then you can call del() to unbind the name _pat if you like. This nicely encapsulates the pattern inside the function, does not depend at all on the function's name, and allows you to pass in an alternate pattern if you wish.

P.S. If your special characters were always a single character long, I would use the code below:

_special = set(['[', ']', '\\', '+']) # add other characters as desired, but only single chars

def escape_query(query):
    return ''.join('\\' + ch if (ch in _special) else ch  for ch in query)
Community
  • 1
  • 1
steveha
  • 74,789
  • 21
  • 92
  • 117
  • I don't really look for best performance, as everything's done in background. Also this regex looks pretty complicate and hard to maintain. – nagisa Aug 12 '11 at 01:55
  • I don't think it is much worse than your existing code; that list of special characters isn't pretty either. But I'll add a note about making the regexp easier. – steveha Aug 12 '11 at 02:03
  • @steveha _"if anyone knows a way to find escape_query.func_dict from inside the function body, please let me know."_ I don't see where's the problem. If you put the instruction ``print escape_query.func_dict`` inside the function block, you'll obtain the display of this dict. – eyquem Aug 12 '11 at 08:50
  • @eyquem, the problem I asked about is a way to find the dict without using the name `escape_query`. As I noted in the answer text, it is possible to rebind the function to a different name, and then that code would fail. In practice it isn't a problem, but I'm wondering if there is any possible way in Python for a function object's code to find the function object's `func_dict` without needing to know the current name the function object is bound to. Put another way, can you write a lambda function that can reference its `func_dict`? – steveha Aug 12 '11 at 21:40
  • @eyquem, I searched StackOverflow and found a question and answer that discusses the issue. I'll edit my answer to link this also. http://stackoverflow.com/questions/3109289/how-can-python-function-access-its-own-attributes – steveha Aug 12 '11 at 23:32
  • @steveha The answer to this question doesn't interest me a lot because I wonder when it could be practically useful. But it is an interesting question on theoretical level. I think it would be instructive to precisely understand what makes that this seems impossible to do, while it isn't clear to me why. But I have no idea. – eyquem Aug 13 '11 at 00:04
  • @steveha I hadn't seen your most recent comment when I posted the last one from me. Interesting link, there are some solutions proposed, wow ! I'll study these tomorrow – eyquem Aug 13 '11 at 01:21
0

Not sure if this is any better but it works and probably faster.

def escape_query(query):
    special_chars = ['\\','+','-','&&','||','!','(',')','{','}','[',']', '^','"','~','*','?',':']
    query = "".join(map(lambda x: "\\%s" % x if x in special_chars else x, query))
    for sc in filter(lambda x: len(x) > 1, special_chars):
        query = query.replace(sc, "\%s" % sc)
    return query
Rumple Stiltskin
  • 9,597
  • 1
  • 20
  • 25