Perl and some other current regex engines support Unicode properties, such as the category, in a regex. E.g. in Perl you can use \p{Ll}
to match an arbitrary lower-case letter, or p{Zs}
for any space separator. I don't see support for this in either the 2.x nor 3.x lines of Python (with due regrets). Is anybody aware of a good strategy to get a similar effect? Homegrown solutions are welcome.
-
18Actually, Perl supports **all** Unicode properties, not just the general categories. Examples include `\p{Block=Greek}, \p{Script=Armenian}, \p{General_Category=Uppercase_Letter}, \p{White_Space}, \p{Alphabetic}, \p{Math}, \p{Bidi_Class=Right_to_Left}, \p{Word_Break=A_Letter }, \p{Numeric_Value=10}, \p{Hangul_Syllable_Type=Leading_Jamo}, \p{Sentence_Break=SContinue},` and around 1,000 more. Only Perl’s and ICU’s regexes bother to cover the full complement of Unicode properties. Everybody else covers a tiny few, usually not even enough for minimal Unicode work. – tchrist Apr 25 '11 at 23:03
6 Answers
The regex module (an alternative to the standard re
module) supports Unicode codepoint properties with the \p{}
syntax.

- 1,514
- 1
- 11
- 9
-
1Not sure how complete the ``\p{}`` support is, but this module is actively developed and should eventually replace the built-in ``re`` module: see http://pypi.python.org/pypi/regex – RichVel Sep 18 '13 at 13:05
-
9+1: `regex` is a drop-in replacement for stdlib's `re` module. If you know how to use `re`; you immediately can use `regex`. `import regex as re` and you have `\p{}` syntax support. Here's an [example how to remove all punctuations in a string using `\p{P}`](http://stackoverflow.com/a/11066687) – jfs Dec 21 '13 at 02:11
-
@RichVel When you say that regex "should" eventually replace the built-in `re` module, did you mean that there are plans for it to do so? I would like to have access to Unicode `\p{}` properties without the dependency. (It causes problems with pypy and pyodide.) – yig Jan 03 '23 at 18:15
-
@yig that was just my personal opinion, not sure if anything's changed since 2013 – RichVel Jan 04 '23 at 11:57
Have you tried Ponyguruma, a Python binding to the Oniguruma regular expression engine? In that engine you can simply say \p{Armenian}
to match Armenian characters. \p{Ll}
or \p{Zs}
work too.
-
3
-
15Last commit to Ponyguruma module was apparently 2010 (http://dev.pocoo.org/hg/sandbox/ponyguruma) whereas the Python regex module on PyPI is actively developed: http://pypi.python.org/pypi/regex – RichVel Sep 18 '13 at 13:04
You can painstakingly use unicodedata on each character:
import unicodedata
def strip_accents(x):
return u''.join(c for c in unicodedata.normalize('NFD', x) if unicodedata.category(c) != 'Mn')

- 1,393
- 1
- 11
- 16
-
Thanks. Although outside regex's, this might be viable alternative for certain cases. – ThomasH Nov 13 '10 at 21:09
-
It seems that the Python `unicodedata` module doesn't presently contain information about e.g. the script or Unicode block of a character. See also https://stackoverflow.com/questions/48058402/unicode-table-information-about-a-character-in-python/48060112#48060112 – tripleee Jan 10 '18 at 06:44
Speaking of homegrown solutions, some time ago I wrote a small program to do just that - convert a unicode category written as \p{...}
into a range of values, extracted from the unicode specification (v.5.0.0). Only categories are supported (ex.: L
, Zs
), and is restricted to the BMP. I'm posting it here in case someone find it useful (although that Oniguruma really seems a better option).
Example usage:
>>> from unicode_hack import regex
>>> pattern = regex(r'^\\p{Lu}(\\p{L}|\\p{N}|_)*')
>>> print pattern.match(u'疂_1+2').group(0)
疂_1
>>>
Here's the source. There is also a JavaScript version, using the same data.
-
2Nice one, although you're using hand-crafted literals for the ranges in the code. It would be nice to have those literals be generated from some textual form of the spec. Or from unicodedata (http://docs.python.org/library/unicodedata.html#module-unicodedata). You could probably run through all valid unicode code points and run them through unicodedata.category(), and use the output to populate the map ... – ThomasH Mar 06 '12 at 17:43
-
Thanks for the tip, I may implement that someday. The code above was created for JavaScript first (for which there were few sensible alternatives at the time), then ported to Python. I ran some regexes on the specs and finished with a throwaway script, but I agree a repeatable procedure would have been better, so I can keep it up-do-date. – mgibsonbr Mar 07 '12 at 00:52
-
I've hacked up a quick function that builds up the map dynamically (with just lists of chars as values): def unicats(maxu): m = defaultdict(list) for i in range(maxu): try: cat=unicodedata.category(unichr(i)) except: cat=None if cat: m[cat].append(i) return m m=unicats(10FFFF) Be aware that some categories get really big (e.g. len(m['Cn']) == 873882). – ThomasH Mar 07 '12 at 09:09
You're right that Unicode property classes are not supported by the Python regex parser.
If you wanted to do a nice hack, that would be generally useful, you could create a preprocessor that scans a string for such class tokens (\p{M}
or whatever) and replaces them with the corresponding character sets, so that, for example, \p{M}
would become [\u0300–\u036F\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F]
, and \P{M}
would become [^\u0300–\u036F\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F]
.
People would thank you. :)

- 44,698
- 7
- 80
- 103
-
1Right, creating character classes crossed my mind. But with roughly 40 categories you end up producing 80 classes, and that's not counting unicode scripts, blocks, planes and whatnot. Might be worth a little open source project, but still a maintenance nightmare. I just discovered that re.VERBOSE doesn't apply to character classes, so no comments here or white space to help readability... – ThomasH Dec 02 '09 at 15:17
Note that while \p{Ll}
has no equivalent in Python regular expressions, \p{Zs}
should be covered by '(?u)\s'
.
The (?u)
, as the docs say, “Make \w, \W, \b, \B, \d, \D, \s and \S dependent on the Unicode character properties database.” and \s
means any spacing character.

- 92,761
- 29
- 141
- 204
-
2You're right. Problem is, '(?u)\s' is larger than '\p{Zs}', including e.g. newline. So if you really want to match only space separators, the former is overgenerating. – ThomasH Dec 07 '09 at 21:32
-
3@ThomasH: to get "space except not newline" you can use the double-negated character class: `(?u)[^\S\n]` – bukzor Mar 22 '12 at 22:12