Trying to write a python regex that will validate a string comprised of
- Any unicode alphanumeric character (including combining characters)
- Any number of space characters
- Any number of underscores
- Any number of dashes
- Any number of periods
My test strings:
9 Melodía.de_la-montaña
9 Melodía.de_la-montaña
or as string literals produced with ascii()
:
str1 = '9 Melod\xeda.de_la-monta\xf1a'
str2 = '9 Melodi\u0301a.de_la-montan\u0303a'
These look identical but aren't, one is normalized and the other uses the combining characters for the inflections.
Here's my first stab:
import re
reg = re.compile("^[\w\.\- ]+$", re.IGNORECASE)
re.search(reg, str1) # None
re.search(reg, str2) # None
If I remove the positional qualifiers and use findall
instead of search
I get lists like this ['9 Melodi', 'a.de_la-montan', 'a']
or ['9 Melod', 'a.de_la-monta', 'a']
.
I've even tried re.compile("^[\w\.\- ]+$", re.IGNORECASE | re.UNICODE)
although that should be unnecessary in python 3 right?
In searching for an answer I've found this question and this one and this one and this one but they are all old, deal with python 2, and seem to suggest that the regex I wrote should work. The python 3.5 regex docs mention that \w
should match unicode but offer no actual examples involving non-ASCII text.
How do I match the desired strings?