There are two parts to this, which should work for all languages:*
- Your strings must be into NFKD normalization to guarantee that two equal strings have equal code units.
- To ignore case in comparing two NFKD strings, use the Unicode case-folding algorithm.
Between the two, this handles English upper and lower case, Arabic initial/medial/final (plus isolated), German ß
vs. ss
, é
as a single code point vs. e\N{COMBINING ACUTE ACCENT}
, Chinese rotated characters, Japanese half-width kana, and probably all kinds of other things you haven't thought of.
In Python, that looks like this:
>>> s1 = 'ﻧ'
>>> s2 = 'ﻨ'
>>> unicodedata.normalize('NFKD', s1).casefold() == unicodedata.normalize('NFKD', s2)
True
Note that casefold
wasn't added until Python 3.3. If you're using an earlier version of Python, there are implementations on PyPI; using them should be similar to using the 3.3+ builtin.
If you're interested in exactly how this works for Arabic, rather than just the fact that it works for Arabic along with every other language, you have read the algorithms and tables at unicode.org. IIRC, the W3C document that recommends doing this explains why it works using Arabic as an example. I believe it's because Unicode treats initial, medial, final, and isolated as compatibility-equivalent presentation forms of the same character, so normalizing to decomposed gives you effectively the isolated form plus a modifier that casefolding can skip or transform, even though casefolding directly on a combined character just returns the character itself.
* There are a few cases where two different languages or cultures use the same script, but have different case-folding rules; in that case, you need locale-specific casefolding, which Python doesn't include. But that shouldn't be relevant here.