python complementing a complex regular expression

Question

Trying to learn regular expressions and despite some great posts on here and links to a regEx site, I have a case I was trying to hack out of sheer stubbornness that defied producing the match I was looking for. To understand it, consider the following code which allows us to pass in a list of strings and a pattern and find out if the pattern either matches all items in the list or matches none of them:

import re
def matchNone(pattern, lst):
    return not any([re.search(pattern, i) for i in lst])

def matchAll(pattern, lst):
    return all([re.search(pattern, i) for i in lst])

To help w/ debugging, this simple code allows us to just add _test to a function call and see what is being passed to the any() or all() functions which ultimately return the result:

def matchAll_test(pattern, lst):
    return [re.search(pattern, i) for i in lst]

def matchNone_test(pattern, lst):
    return ([re.search(pattern, i) for i in lst])

This pattern and list produces True from matchAll():

wordPattern = "^[cfdrp]an$"
matchAll(wordPattern, ['can', 'fan', 'dan', 'ran', 'pan']) # True

This pattern on the surface appears to work with matchNone() in our effort to reverse the pattern:

wordPattern = "^[^cfdrp]an|[cfdrp](^an)$"
matchNone(wordPattern, ['can', 'fan', 'dan', 'ran', 'pan']) # True

It returns True as we hoped it would. But a true reversal of this pattern would return False for a list of values where none of them are equivalent to our original list ['can', 'fan', 'dan', 'ran', 'pan'] regardless of what else we pass into it. (i.e. "match anything except these 5 words")

In testing to see what changes to the words in this list will get us a False, we quickly discover the pattern is not as successful as it first appears. If it were, it would fail for matchNone() on anything not in the aforementioned list.

These permutations helped uncover the short-comings of my pattern tests:

["something unrelated", "p", "xan", "dax", "ccan", "dann", "ra"]

In my exploration of above, I tried other permutations as well taking the original list, using the _test version of the functions and changing one letter at a time on the original words, and or modifying one term or adding one term from permutations like what is above.

If anyone can find the true inverse of my original pattern, I would love to see it so I can learn from it.

To help with your investigation:

This pattern also works with matchAll() for all words, but I could not seem to create its inverse either: "^(can|fan|dan|ran|pan)$"

Thanks for any time you expend on this. I'm hoping to find a regEx guru on here who spots the mistake and can propose the right solution.

`(^an)` doesn't mean "anything but `an`". Outside of a character class, `^` matches the start of the string (and also the beginning of a line in MULTILINE mode). — user2357112, Mar 22 '17 at 19:06
Even if it *did* mean "anything but `an`", you still wouldn't have correctly inverted your regex, though. — user2357112, Mar 22 '17 at 19:07
Responding to all comments on here. Whereas the first pattern successfully matches the 5 words in the list, the "reverse" or "inverse" I was looking for (but was grasping for words) is a pattern that matches everything passed to it except the 5 words in our sample. Regarding the comments on "^", on various regEx sites and in the class I am taking, this character has two contexts: at the very start of a pattern it means start of the pattern, but inside [] and possibly inside () or placed in the middle of a pattern, it means "not this". . . — TMWP, Mar 22 '17 at 19:17
My pattern tests all get partway there, but all of them fail when I pass in some of the given test cases so they are not true "reversals" or the original and fail at goal of match all except what is in the list. As stated, there are probably other ways around this use case, but finding the regEx equivelence of what I am looking for will go a long way to helping me grasp its syntax and some of the confusing results I see with patterns I try. — TMWP, Mar 22 '17 at 19:17
In what way do your initial `matchAll` and `matchNone` functions not work? I see no reason not to just use them with the same word pattern. — TemporalWolf, Mar 22 '17 at 19:19
*and possibly inside () or placed in the middle of a pattern, it means "not this"* is **wrong**. — Wiktor Stribiżew, Mar 22 '17 at 19:20
This is complex to describe but based on comments on here, I just attempted to edit the post for clarity. — TMWP, Mar 22 '17 at 19:27
Last question on here: my pattern passes the "matches none" test, but if I pass in other values (as described in the post), it will match on some of them too. As a purely academic exercise to better understand regular expressions, I am looking for a pattern that is the equivelent of: match anything that is not the 5 words in my list. The functions and what-not are provided for context and to help test success or failure in that regard. Does that help clarify for you? — TMWP, Mar 22 '17 at 19:28
I have a solution that's fairly short (19 characters) and simple, but the gigantic question text with so much fluff and the actual question still missing annoys me so much that I don't want to post it... — Stefan Pochmann, Mar 22 '17 at 20:54

R1tschY · Accepted Answer · 2017-03-22T20:11:48.687

3

I hope I understood your question. This is the solution that I found:

^(?:[^cfdrp].*|[cfdrp][^a].*|[cfdrp]a[^n].*|.{4,}|.{0,2})$

[^cfdrp].*: if text starts not with c, f, d, r or p than match
[cfdrp][^a].*: text starts with c, f, d, r or p: match if second character is not an a
[cfdrp]a[^n].*: text starts with [cfdrp]a: match if third character is not a n.
.{4,}: match anything with more than 3 characters
.{0,2}: match anything with 0, 1, or 2 characters

It is equal to:

^(?:[^cfdrp].*|.[^a].*|..[^n].*|.{4,}|.{0,2})$

edited Mar 22 '17 at 20:11

answered Mar 22 '17 at 20:03

R1tschY

3,032
1
24
30

could you explain the "?:" ... having trouble getting even the help on this part of the syntax, but I think you may have gotten it. Doing some testing and some reverse engineering of your explanation to better understand and validate it and if it works as I expect, the answer will be accepted shortly. – TMWP Mar 22 '17 at 20:40
Your answer was accepted. If you have not done so already, if you could vote for the question it would be appreciated. My questions with answers and comments but no votes almost cost me the ability to post new questions last week because of a weird algorithm on this site. – TMWP Mar 22 '17 at 22:04
@TMWP: without the `?:` it will be the same for your use case. It generates a non-capturing group instead of a capturing group. Important is that everything between `^` and `$` is in one group. Otherwise it is something like: `(^[^cfdrp].*)|(.[^a].*)|(..[^n].*)|(.{4,})|(.{0,2}$)` – R1tschY Mar 23 '17 at 08:43
Thanks. I have been learning exactly what I hoped to from this exchange. Analyzing the patterns you submitted, I now feel like I have a better understanding of regEx so I will write more patterns that work in the future. Best wishes and have a good weekend. – TMWP Mar 23 '17 at 12:37

mbomb007 · Answer 2 · 2018-02-01T15:59:33.853

1

What you are looking to do is find the complement. To do this for any regular expression is a difficult problem. There is no built-in for complementing a regex.

There is an open challenge on PPCG to do this. One comment explains the difficulty:

It's possible, but crazy tedious. You need to parse the regexp into an NFA (eg Thompson's algorithm), convert the NFA to a DFA (powerset construction), complete the DFA, find the complement, then convert the DFA to a RE (eg Brzozowski's method). ie slightly harder than writing a complete RE engine!

There are Python libraries out there that will convert from a regular expression (the original specification refers to a "regular language", which only has literals, "or", and "star" -- much simpler than the type of regex you're thinking of [more info here]) to an NFA, to a DFA, complement it, and convert it back. It's pretty complicated.

Here's a related SO question: Finding the complement of a DFA?

In summary, it's far simpler to find the result of the original regular expression instead, then use Boolean negation.

edited Feb 01 '18 at 15:59

answered Mar 22 '17 at 20:09

mbomb007

3,788
3
39
68

This was helpful and explains why as patterns grow large, the answer should not be attempted if another way can be found. I think with 5 terms, it should be possible and am evaluating another solution on here as I type this. Thanks for the useful comments. The whole point was to understand RegEx syntax better which is why an actual pattern-match to this limited example was what was being looked for. Best wishes. – TMWP Mar 22 '17 at 20:42
Here is an [implementation in Python](https://ideone.com/eWJCKb). I put a comment at the bottom of the code explaining what operations are allowed. It's not quite the complement you're looking for, because it only includes letters that are in the input. Also, it's not always fully simplified. In this, `+` means OR. `*` is the same. So `a+b`'s complement is `(a+b)(a+b)(a+b)*+()` (an `a` or `b`, followed by another, and then zero or more of them, OR the empty string.) – mbomb007 Mar 22 '17 at 21:19
Did you create the code on "implementation in Python" link above? If so - that's an amazing piece of code that must have taken you long to do. I only just saw this comment. Oddly though, the human solution I accepted, though not as robust (for any scenario) was a cleaner answer that passed all tests I had devised to throw at it. But your code theoretically could solve anything, even something a human could not so that is really impressive! – TMWP Mar 22 '17 at 22:20
@TMWP I told you, it's not a Python regex that it outputs. You need to read what I said more thoroughly. The [formal definition](https://en.wikipedia.org/wiki/Regular_expression#Formal_definition) of a regular expression DOES NOT have character classes, or negation, or non-greedy modifiers, or capture groups. It only has `+` (which acts like Python's `|`), `*` (same as Python's `*`), and concatenation. – mbomb007 Mar 22 '17 at 22:40
So if you want the complement of Python's `a+`, you have to use `aa*`. If you want Python's `a|b`, you must use `a+b`. Etc. Then after receiving the result, you'd have to convert it *back* into Python's syntax. The complement of `aa*` is `aaa*+()`, which means "2 or more `a`'s or empty string", which converts back to `aa+|` in Python, which is the complement if the only letter in the alphabet is `a`. The code does not have the ability to use a larger alphabet than the regex uses yet. – mbomb007 Mar 22 '17 at 22:45
@mbomb007 Why do you use `+`? I've never seen that for this. Not even the formal definition you're referencing uses `+` but uses `|`. And the complement of `aa*` (with alphabet {`a`}) isn't `aaa*+()` but only `ε`. – Stefan Pochmann Mar 23 '17 at 14:03
I probably copied the complement wrong. It's actually the complement of `a`. Also, I didn't write this code, I just copied it. The link is at the top. – mbomb007 Mar 23 '17 at 14:08
@StefanPochmann Also, the formal definition as it was originally specified used `+`. Wikipedia uses `|` to avoid introducing confusion, since it's documenting regular expressions. – mbomb007 Feb 01 '18 at 16:07

python complementing a complex regular expression

2 Answers2