Regex [A-Z] Do Not Recognize Local Characters

Question

I've checked other problems and I've read their solutions, they do not work. I've tested the regular expression it works on non-locale characters. Code is simply to find any capital letters in a string and doing some procedure on them. Such as minikŞeker bir kedi would return kŞe however my code do not recognize Ş as a letter within [A-Z]. When I try re.LOCALE as some people request I get error ValueError: cannot use LOCALE flag with a str pattern when I use re.UNICODE

import re
corp = "minikŞeker bir kedi"
pattern = re.compile(r"([\w]{1})()([A-Z]{1})", re.U)
corp = re.sub(pattern, r"\1 \3", corp)
print(corp)

Works for minikSeker bir kedi doesn't work for minikŞeker bir kedi and throws error for re.L. The Error I'm getting is ValueError: cannot use LOCALE flag with a str pattern Searching for it yielded some git discussions but nothing useful.

Possible duplicate of [Regular expression to match non-English characters?](https://stackoverflow.com/questions/150033/regular-expression-to-match-non-english-characters) — Venki WAR, May 12 '18 at 04:44
I've added the error the the main body of the text. I've read that question before. \u escape did not work. Unlike the question holder of that link I'm not interested in locale characters at all, I want to be able to perform tasks I normally perform by having them included in my charset of [A-Z] — Salih F. Canpolat, May 12 '18 at 04:51
As far as I understand you want for the input "minikSeker bir kedi" (without special characters) the output "kSe", right? But it does not work. — xhancar, May 12 '18 at 04:59
I'd like to be able to use "minikŞeker kedi" and get output "minik Şeker kedi". Since Ş is not recognized by [A-Z] I can't unless I build my pattern like [A-ZŞÇĞİÜ], however doing this every time I encounter a UTF-8 problem would be absurd, there should be a way to point regex to UTF-8 charset. — Salih F. Canpolat, May 12 '18 at 05:07
You usually don't want to "point regex to UTF-8 charset", you just want to use a Unicode `str` (or `unicode`, for Python 2) for your pattern and your search string, and then you have access to all of Unicode. In fact, if you're using Python 3, you're already doing that. But that doesn't solve your problem on its own. — abarnert, May 12 '18 at 05:11

abarnert · Answer 1 · 2018-05-12T05:30:50.600

The problem is that Ş is not in the range [A-Z]. That range is the class of all characters whose codepoints lie U+0040 and U+005A (inclusive). (If you were using bytes-mode, it would be all bytes between 0x40 and 0x5A.) And Ş is U+0153 (or, e.g., 0xAA in bytes, assuming latin2). Which isn't in that range.

And using a locale won't change that. As re.LOCALE explains, all it does is:

Make \w, \W, \b, \B and case-insensitive matching dependent on the current locale.

Also, you almost never want to use re.LOCALE. As the docs say:

The use of this flag is discouraged as the locale mechanism is very unreliable, it only handles one “culture” at a time, and it only works with 8-bit locales.

If you only care about a single script, you can build a class of the appropriate ranges for that script.

If you want to work with all scripts, you need to build a class out of a Unicode character class like Lu for "all uppercase letters". Unfortunately, Python's re doesn't have a mechanism for doing this directly. You can build a giant class out of the information in unicodedata, but that's pretty annoying:

Lu = '[' + ''.join(chr(c) for c in range(0, 0x10ffff) 
                   if unicodedata.category(chr(c)) == 'Lu') + ']'

And then:

pattern = re.compile(r"([\w]{1})()(" + Lu + r"{1})", re.U)

… or maybe:

pattern = re.compile(rf"([\w]{{1}})()({Lu}{{1}})", re.U)

But the good news is that part of the reason re doesn't have any way to specify Unicode classes is that for a long time, the plan was to replace re with a new module, so many suggested new features for re were rejected. But the good news is that the intended new module is available as a third-party library, regex. It works just fine, and is a near drop-in replacement for re; it was just improving too quickly to lock it down to the slower Python release schedule. If you install it, then you can write your code this way:

import regex
corp = "minikŞeker bir kedi"
pattern = regex.compile(r"([\w]{1})()(\p{Lu}{1})", re.U)
corp = regex.sub(pattern, r"\1 \3", corp)
print(corp)

The only change I made was to replace re with regex, and then use \p{Lu} instead of [A-Z].

There are, of course, lots of other regex engines out there, and many of them also support Unicode character classes. Most of those that do follow some variation on the same \p syntax. (They all copied it from Perl, but the details differ—e.g., regex's idea of Unicode classes comes from the unicodedata module, while PCRE and PCRE2 attempt to be as close to Perl as possible, and so on.)

Thanks a lot! This solves all of my problems! I probably be using regex a lot in my future projects therefore this solves a lot of future problems all in once and I'm no stranger to adding modules I conda it, read docs and start right away! — Salih F. Canpolat, May 12 '18 at 05:27

score 0 · Answer 2 · answered May 12 '18 at 06:00

abarnet's answer is great, but if all you want to do is find upper case characters, str.isupper() works without the need for an extra module.

>>> foo = "minikŞeker bir kedi"
>>> for i, c in enumerate(foo):
...     if c.isupper():
...         print(foo[i-1:i+2])
...         break
... 
kŞe

or perhaps

>>> foo = "minikŞeker bir kedi"
>>> ''.join((' ' if c.isupper() else '') + c for c in foo)
'minik Şeker bir kedi'

Regex [A-Z] Do Not Recognize Local Characters

2 Answers2

Linked