How to find superfluous escapes in regex patterns

Question

How to find and remove all the unneeded backslash escapes in Python regular expressions.

For example in r'\{\"*' all the escapes are unnecessary and has the same meaning as r'{"*'. But in r'\[a-b]\{2}\Z\'\+' removing any of the escapes would change how the regex is interpreted by the regex engine (or cause a syntax error).

Given the pattern, is there an easy, i.e. other than perhaps parsing the whole regex string looking for escapes on non-special characters, way to remove escape patterns programmatically in Python?

`r'\a'` does *not* match the same thing as `r'a'` in Python. — Ry-, Dec 29 '17 at 05:28
You could try using the internal `sre_parse` module – `list(sre_parse.parse(r"{")) == list(sre_parse.parse(r"\{"))`. — Ry-, Dec 29 '17 at 05:47
@Ryan, I was thinking about `re.DEBUG` flag, but working with the `sre_parse.parse` is indeed easier. I didn't even know it was there. Thanks! — AXO, Dec 29 '17 at 05:58
I don't understand people who are saying this is homework question and leave a downvote. First, how do you know that? And what if it is a homework question, [is there any policy against homework question?](https://meta.stackoverflow.com/questions/334822/how-do-i-ask-and-answer-homework-questions) Someone asks [How do I download a file over HTTP using Python?](https://stackoverflow.com/questions/22676/how-do-i-download-a-file-over-http-using-python) and gets hundreds of upvotes, I ask "How to find superfluous escapes" and all the downvotes. — AXO, Dec 29 '17 at 14:01

AXO · Accepted Answer · 2017-12-29T10:40:55.457

Here is the code that I came up with:

from contextlib import redirect_stdout
from io import StringIO

from re import compile, DEBUG, error, MULTILINE, VERBOSE


def unescape(pattern: str, flags: int):
    """Remove any escape that does not change the regex meaning"""
    strio = StringIO()
    with redirect_stdout(strio):
        compile(pattern, DEBUG | flags)
        original_debug = strio.getvalue()
    index = len(pattern)
    while index >= 0:
        index -= 1
        character = pattern[index]
        if character != '\\':
            continue
        removed_escape = pattern[:index] + pattern[index+1:]
        strio = StringIO()
        with redirect_stdout(strio):
            try:
                compile(removed_escape, DEBUG | flags)
            except error:
                continue
        if original_debug == strio.getvalue():
            pattern = removed_escape
    return pattern

def print_unescaped_raw(regex: str, flags:int=0):
    """Print an unescaped raw-string representation for s."""
    print(
        ("r'%s'" % unescape(regex, flags)
        .replace("'", r'\'')
        .replace('\n', r'\n'))
    )

print_unescaped_raw(r'\{\"*')  # r'{"*'

One can also use sre_parse.parse directly, but the SubPatterns and tuples in the result may contain nested SubPatterns. And SubPattern instances don't have __eq__ method defined for them, so a recursive comparison subroutine might be required.

P.S. Unfortunately, this method does not work with the regex module because in regex you get different debug output for escaped characters:

regex.compile(r'{', regex.DEBUG)
LITERAL MATCH '{'

regex.compile(r'\{', regex.DEBUG)
CHARACTER MATCH '{'

Unlike re that gives:

re.compile(r'{', re.DEBUG)
LITERAL 123

re.compile(r'\{', re.DEBUG)
LITERAL 123

score 0 · Answer 2 · answered Dec 29 '17 at 05:30

I will not do the whole implementation but I can give you some hints to make a viable heuristic/algo:

Initial Hypothesis: You have for each regex that you are going to modify a list of input strings/expected output strings to validate its behavior
Use this website to have the list of characters that should stay escaped with the backslash \ http://www.rexegg.com/regex-quickstart.html and Create a list of elements that should not be replaced
Parse your regex and replace all the \X where X is a character that is not present in the list generated at the previous step by X
Test your initial regex on its input strings and test your new regex on the same input strings and compare their respective outputs for all the result
If all of your results are the same, then you can use your new/simplified regex.
If at least one of the output is different then you have to throw away your new regex and proceed with local replacements: select randomly (round robin could be used) one of the \X in your initial regex that is not in the list that you have construct at step 1. and replace it by X check the output in comparison to the initial regex output for each input string if it matches you can use that regex and repeat step 5. until it is not possible to progress anymore. however, If the output is different for that replacement remove it from the list of elements you might be able to replace and repeat the step 5 with your previous regex. Do the process until your list of possible local replacement is empty, you can use the new regex instead of the old one.

How to find superfluous escapes in regex patterns

2 Answers2