6

I need to find abbreviations text in many languages. Current regex is:

import regex as re
pattern = re.compile('(?:[\w]\.)+', re.UNICODE | re.MULTILINE | re.DOTALL | re.VERSION1)
pattern.findall("U.S.A. u.s.a.")

I don't need u.s.a in the result, i need only uppercase text. [A-Z] won't work in any language except english.

artyomboyko
  • 2,781
  • 5
  • 40
  • 54

1 Answers1

13

You need to use a Unicode character property in order to match them. re does not support character properties, but regex does.

>>> regex.findall(ur'\p{Lu}', u'ÜìÑ')
[u'\xdc', u'\xd1']
Ignacio Vazquez-Abrams
  • 776,304
  • 153
  • 1,341
  • 1,358
  • The `Lu` in `\p{Lu}` marks the character properties **L**etter **u**ppercase. Wikipedia has [more information and examples on character properties](https://en.wikipedia.org/wiki/Unicode_character_property#General_Category). The [regex documentation](https://bitbucket.org/mrabarnett/mrab-regex/src/hg/README.rst#rst-header-unicode-codepoint-properties-including-scripts-and-blocks) indicates that some of the other exampls are also supported. – vlz Jun 17 '20 at 13:07