4

I have a list of regexes from which I want to extract those that are equivalent to a string comparison.

For example, those regexes are equivalent to a simple string comparison:

[r"example",   # No metacharacters
 r"foo\.bar"]  # . is not a metacharacter because it is escaped

while those regexes are not:

[r"e.ample",   # . is a metacharacter
 r"foo\\.bar"] # . is a metacharacter because it is not escaped

According to https://docs.python.org/2/howto/regex.html, the list of valid metacharacters is . ^ $ * + ? { } [ ] \ | ( ).

I'm about to build a regex, but it looks to be a bit complicated. I'm wondering if there's a shortcut by examining the re object or something.

Vincent Savard
  • 34,979
  • 10
  • 68
  • 73
samwyse
  • 2,760
  • 1
  • 27
  • 38
  • 1
    No shortcut to learning how to write a regex. I find using https://regex101.com/ useful for checking the work I'm doing. – AlG Feb 23 '16 at 14:57
  • 1
    @AIG: He doesn't want to write a regex. He essentially wants to find if a string contains any non-escaped regex metacharacters, making the regex useless because a simple equality check could be used. – Vincent Savard Feb 23 '16 at 14:58
  • @VincentSavard it looks like some backslashes may have been lost somewhere. I want to keep backslash-dot (equivalent to a comparison to the literal dot, but discard backslash-backslash-dot (a literal backslash followed by any character). – samwyse Feb 23 '16 at 15:05
  • Searching for the regex 'example' is the same as str.find("example"). Searching for 'e.ample' cannot be replaced with a simple find. – samwyse Feb 23 '16 at 15:14
  • 1
    you can check the output of re.DEBUG, to see if it only contains literals -> http://stackoverflow.com/questions/606350/how-can-i-debug-a-regular-expression-in-python – Keith Hall Feb 23 '16 at 15:28
  • @samwyse: I edited your question a bit because some people are voting to close it. If you think I changed the meaning of your question, feel free to rollback to the previous revision or edit it further. – Vincent Savard Feb 23 '16 at 15:32
  • Are the list of regexes targeting file names? If yes, any strings containing `\.`, `\$`,`\+`, `\{`,`\}`,`\[`,`\]`,`\(`,`\)` are all valid file names. keep that in mind. – Quinn Feb 23 '16 at 15:33
  • Is this just an academic exercise, or are you attempting to perform some sort of optimization? – Bryan Oakley Feb 23 '16 at 15:36
  • I have some WSGI apps each with a long list of patterns to match URLs against. I'd like to programmatically find the ones for fixed pages, i.e. 'cust' but not 'cust/(.*)'. – samwyse Feb 23 '16 at 15:40

2 Answers2

6

Inspired by Keith Hall's comment, here's a solution based on an undocumented feature of Python's regex compiler:

import re, sys, io

def contains_meta(regex):
    stdout = sys.stdout            # remember stdout
    sys.stdout = io.StringIO()     # redirect stdout to string
    re.compile(regex, re.DEBUG)    # compile the regex for the debug tree side effect
    output = sys.stdout.getvalue() # get that debug tree
    sys.stdout = stdout            # restore stdout
    return not all(line.startswith("LITERAL ") for line in output.strip().split("\n"))

Output:

In [9]: contains_meta(r"example")
Out[9]: False

In [10]: contains_meta(r"ex.mple")
Out[10]: True

In [11]: contains_meta(r"ex\.mple")
Out[11]: False

In [12]: contains_meta(r"ex\\.mple")
Out[12]: True

In [13]: contains_meta(r"ex[.]mple")  # single-character charclass --> literal
Out[13]: False

In [14]: contains_meta(r"ex[a-z]mple")
Out[14]: True

In [15]: contains_meta(r"ex[.,]mple")
Out[15]: True
Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • Apparently, Python 2 uses `literal` whereas Python 3 uses `LITERAL` as a prefix, so you'll need to account for that if you're using Python 2. – Tim Pietzcker Feb 23 '16 at 16:11
  • That fifth example (`ex[.]mple`) may be a dealbreaker if the OP is wanting to be able to treat the pattern as a literal string,. It might be worth pointing out that this doesn't literally answer the question "does this have metacharacters", but only "does this pattern reduce down to a static string". Clearly `e[.]ample` has metacharacters and cannot be used as a literal string. I'm not suggesting that this answer is wrong, just that there may be some hidden gotchas depending on how the OP plans to use the function. – Bryan Oakley Feb 23 '16 at 16:33
  • For OP's scenario, `[r'\{',r'\}',r'\^',r'\\',r'\|']` are all False in meta check, but they are not valid characters in a URL (see: http://stackoverflow.com/questions/7109143/what-characters-are-valid-in-a-url). – Quinn Feb 23 '16 at 16:35
  • @BryanOakley: I do think the question *is* "does it reduce to a static string?" - otherwise `ex\.mple` would have the same problem as `ex[.]mple`. OP is explicitly asking whether any *unescaped* metacharacters are in the string, so I guess that's what he wants - his comments indicate he's looking for static strings. – Tim Pietzcker Feb 23 '16 at 16:40
  • I'm giving both answers upvotes, but this one gets accepted. I did make one small change to *contains_meta* however, to return the literal string or throw a TypeError: return ''.join(line.startswith('literal ') and chr(int(line[8:])) for line in output.strip().split("\n")) – samwyse Mar 06 '16 at 15:38
2

Here is a regex that you can use to detect non-escaped metacharacters in python:

>>> rex = re.compile(r'^([^\\]*)(\\.[^.^$*+?{}\[\]|()\\]*)*[.^$*+?{}\[\]|()]',re.MULTILINE)

>>> arr = [r"example", r"foo\.bar", r"e.ample", r"foo\\.bar", r"foo\\bar\.baz"]

>>> for s in arr:
...     print s, re.search(rex, s) != None
...

Above regex scans the input for any escaping using \ and then it ignores the character that comes next to \. Finally it searches for a meta-character which is one of the:

. ^ $ * + ? { } [ ] | ( ) \ ]

characters without preceding \.

Output:

example False
foo\.bar False
e.ample True
foo\\.bar True
foo\\bar\.baz False

Code Demo

anubhava
  • 761,203
  • 64
  • 569
  • 643