Python: how to do proper language-aware case conversion of strings?

Question

The python library methods upper() and lower() always convert case according to the Unicode "default" rules. These rules are language-agnostic, in other words, possibly wrong when applied to a language where the case conversion differs from the default rule.

In Java, the methods for changing the case for Strings are locale aware, but I was unable to find such functions in either the Python standard library or any additional package.

What is the best way to do this in Python 3, short of having to manually implement it oneself (which is probably pretty much impossible if the aim is to do it for more than one language).

Examples of language specific case conversion:

Turkish lower case dotted/undotted i maps to the corresponding upper case version: ı/I i/İ
German lower case scharfes s maps to upper case SS for German: ß/SS

Ideally this should, like in Java, do something like assert "İ" == "i".to_upper("tr", "TR")

UPDATE based on comments (thanks for those!):

I am not looking for casefold which could be seen as avoiding, rather than solving the problem. Casefold still is language agnostic.
I have seen the python locale package but I do not see the connection to my problem: the lower()/upper() methods are documented to use the language agnosting default scheme so I am not expecting that changing the locale will change their behavior. Also, at least on my system, changing the locale to "tr" "TR" does not even work. The language specific treatment of case conversion should not require that the host system actually has all locales installed.
I am mainly interested in the many ways how those language-specific mappings are officially or very commonly done in different countries/languages, not ALL possible mappings. I can live with rare or very specific mappings having to be implemented by myself, just wondering about a standard set to handle correctly (again, much like Java does).

UPDATE2:

It seems there are actually not that many special cases which would be officially described in the Unicode documentation. This document only has a few exceptions for just Turkish and Lithuanian documented: https://www.unicode.org/Public/UNIDATA/SpecialCasing.txt All other mappings (e.g. Greekscript) which are not dependent on language context are already implemented in Python since they are part of the Unicode default case conversion specification.

[`str.casefold`](https://docs.python.org/3/library/stdtypes.html#str.casefold) — Hampus Larsson, Jun 04 '20 at 10:38
Have you seen https://docs.python.org/3/library/locale.html ? — khelwood, Jun 04 '20 at 10:40
@usr2564301 yes, that is correct, but that is not used as the official upper-case variant of the lower case scharfes S in any country, so I would accept that if I want this to work I have to implement it myself. I am more interested in "standard" mappings officially used. — jpp1, Jun 04 '20 at 12:17
@usr2564301 the PyICU package looks like a good solution, with the potential downside that it does not work on Python strings directly and needs to convert back and forth between its own representation and Python. Not sure what the cost of this is yet. — jpp1, Jun 04 '20 at 12:52
You also have to factor in the cost of doing it all by yourself `:)` — Jongware, Jun 04 '20 at 13:18

Python: how to do proper language-aware case conversion of strings?

0 Answers0