4

I see a lot of similarly worded questions, but I've had a strikingly difficult time coming up with the syntax for this.

Given a list of words, I want to print all the words that do not have special characters.

I have a regex which identifies words with special characters \w*[\u00C0-\u01DA']\w*. I've seen a lot of answers with fairly straightforward scenarios like a simple word. However, I haven't been able to find anything that negates a group - I've seen several different sets of syntax to include the negative lookahead ?!, but I haven't been able to come up with a syntax that works with it.

In my case given a string like: "should print nŌt thìs"

should print should and print but not the other two words. re.findall("(\w*[\u00C0-\u01DA']\w*)", paragraph.text) gives you the special characters - I just want to invert that.

Grant Curell
  • 1,321
  • 2
  • 16
  • 32

2 Answers2

6

For this particular case, you can simply specify the regular alphabet range in your search:

a = "should print nŌt thìs"
re.findall(r"(\b[A-Za-z]+\b)", a)
# ['should', 'print']

Of course you can add digits or anything else you want to match as well.

As for negative lookaheads, they use the syntax (?!...), with ? before !, and they must be in parentheses. To use one here, you can use:

r"\b(?!\w*[À-ǚ])\w*"

This:

  • Checks for a word boundary \b, like a space or the start of the input string.
  • Does the negative lookahead and stops the match if it finds any special character preceded by 0 or more word characters. You have to include the \w* because (?![À-ǚ]) would only check for the special character being the first letter in the word.
  • Finally, if it makes it past the lookahead, it matches any word characters.

Demo. Note in regex101.com you must specify Python flavor for \b to work properly with special characters.

There is a third option as well:

r"\b[^À-ǚ\s]*\b"

The middle part [^À-ǚ\s]* means match any character other than special characters or whitespace an unlimited number of times.

jdaz
  • 5,964
  • 2
  • 22
  • 34
  • I don't think that first regex works because it matches things like `zhì` - it just matches the zh. – Grant Curell Jul 13 '20 at 13:32
  • Another issue - wouldn't `r"\b(?!\w*[À-ǚ])\w*"` only work on Latin-based special characters? – Grant Curell Jul 13 '20 at 13:42
  • 1
    Yes the first one should also be surrounded by `\b`. Fixed. And yes, by “special characters” I just mean the range you posted, `[\u00C0-\u01DA]` – jdaz Jul 13 '20 at 14:59
0

I know this is not a regex, but just a completely different idea you may not have had besides using regexes. I suppose it would be also much slower but I think it works:

>>> import unicodedata as ud    
>>> [word for word in ['Cá', 'Lá', 'Aqui']\
    if any(['WITH' in ud.name(letter) for letter in word])]
['Cá', 'Lá']

Or use ... 'WITH' not in to reverse.

progmatico
  • 4,714
  • 1
  • 16
  • 27