Python regex uppercase unicode word

Question

I need to find abbreviations text in many languages. Current regex is:

import regex as re
pattern = re.compile('(?:[\w]\.)+', re.UNICODE | re.MULTILINE | re.DOTALL | re.VERSION1)
pattern.findall("U.S.A. u.s.a.")

I don't need u.s.a in the result, i need only uppercase text. [A-Z] won't work in any language except english.

You may find some help at this previous question: http://stackoverflow.com/questions/150033/regular-expression-to-match-non-english-characters — Ben Graham, Sep 26 '12 at 01:57

score 13 · Accepted Answer · answered Sep 26 '12 at 02:05

13

You need to use a Unicode character property in order to match them. re does not support character properties, but regex does.

>>> regex.findall(ur'\p{Lu}', u'ÜìÑ')
[u'\xdc', u'\xd1']

answered Sep 26 '12 at 02:05

The `Lu` in `\p{Lu}` marks the character properties **L**etter **u**ppercase. Wikipedia has [more information and examples on character properties](https://en.wikipedia.org/wiki/Unicode_character_property#General_Category). The [regex documentation](https://bitbucket.org/mrabarnett/mrab-regex/src/hg/README.rst#rst-header-unicode-codepoint-properties-including-scripts-and-blocks) indicates that some of the other exampls are also supported. – vlz Jun 17 '20 at 13:07

1 Answers1