Inverse regex match on group in Python

Question

I see a lot of similarly worded questions, but I've had a strikingly difficult time coming up with the syntax for this.

Given a list of words, I want to print all the words that do not have special characters.

I have a regex which identifies words with special characters \w*[\u00C0-\u01DA']\w*. I've seen a lot of answers with fairly straightforward scenarios like a simple word. However, I haven't been able to find anything that negates a group - I've seen several different sets of syntax to include the negative lookahead ?!, but I haven't been able to come up with a syntax that works with it.

In my case given a string like: "should print nŌt thìs"

should print should and print but not the other two words. re.findall("(\w*[\u00C0-\u01DA']\w*)", paragraph.text) gives you the special characters - I just want to invert that.

jdaz · Accepted Answer · 2020-07-13T14:56:13.167

For this particular case, you can simply specify the regular alphabet range in your search:

a = "should print nŌt thìs"
re.findall(r"(\b[A-Za-z]+\b)", a)
# ['should', 'print']

Of course you can add digits or anything else you want to match as well.

As for negative lookaheads, they use the syntax (?!...), with ? before !, and they must be in parentheses. To use one here, you can use:

r"\b(?!\w*[À-ǚ])\w*"

This:

Checks for a word boundary \b, like a space or the start of the input string.
Does the negative lookahead and stops the match if it finds any special character preceded by 0 or more word characters. You have to include the \w* because (?![À-ǚ]) would only check for the special character being the first letter in the word.
Finally, if it makes it past the lookahead, it matches any word characters.

Demo. Note in regex101.com you must specify Python flavor for \b to work properly with special characters.

There is a third option as well:

r"\b[^À-ǚ\s]*\b"

The middle part [^À-ǚ\s]* means match any character other than special characters or whitespace an unlimited number of times.

I don't think that first regex works because it matches things like `zhì` - it just matches the zh. — Grant Curell, Jul 13 '20 at 13:32
Another issue - wouldn't `r"\b(?!\w*[À-ǚ])\w*"` only work on Latin-based special characters? — Grant Curell, Jul 13 '20 at 13:42
Yes the first one should also be surrounded by `\b`. Fixed. And yes, by “special characters” I just mean the range you posted, `[\u00C0-\u01DA]` — jdaz, Jul 13 '20 at 14:59

score 0 · Answer 2 · answered Oct 14 '20 at 22:32

I know this is not a regex, but just a completely different idea you may not have had besides using regexes. I suppose it would be also much slower but I think it works:

>>> import unicodedata as ud    
>>> [word for word in ['Cá', 'Lá', 'Aqui']\
    if any(['WITH' in ud.name(letter) for letter in word])]
['Cá', 'Lá']

Or use ... 'WITH' not in to reverse.

Inverse regex match on group in Python

2 Answers2