The python library methods upper() and lower() always convert case according to the Unicode "default" rules. These rules are language-agnostic, in other words, possibly wrong when applied to a language where the case conversion differs from the default rule.
In Java, the methods for changing the case for Strings are locale aware, but I was unable to find such functions in either the Python standard library or any additional package.
What is the best way to do this in Python 3, short of having to manually implement it oneself (which is probably pretty much impossible if the aim is to do it for more than one language).
Examples of language specific case conversion:
- Turkish lower case dotted/undotted i maps to the corresponding upper case version: ı/I i/İ
- German lower case scharfes s maps to upper case SS for German: ß/SS
Ideally this should, like in Java, do something like assert "İ" == "i".to_upper("tr", "TR")
UPDATE based on comments (thanks for those!):
- I am not looking for casefold which could be seen as avoiding, rather than solving the problem. Casefold still is language agnostic.
- I have seen the python locale package but I do not see the connection to my problem: the lower()/upper() methods are documented to use the language agnosting default scheme so I am not expecting that changing the locale will change their behavior. Also, at least on my system, changing the locale to "tr" "TR" does not even work. The language specific treatment of case conversion should not require that the host system actually has all locales installed.
- I am mainly interested in the many ways how those language-specific mappings are officially or very commonly done in different countries/languages, not ALL possible mappings. I can live with rare or very specific mappings having to be implemented by myself, just wondering about a standard set to handle correctly (again, much like Java does).
UPDATE2:
- It seems there are actually not that many special cases which would be officially described in the Unicode documentation. This document only has a few exceptions for just Turkish and Lithuanian documented: https://www.unicode.org/Public/UNIDATA/SpecialCasing.txt All other mappings (e.g. Greekscript) which are not dependent on language context are already implemented in Python since they are part of the Unicode default case conversion specification.