15

In Latin script, letters have an upper case and a lower case form. In Python, if you want to compare two strings without regard to their case, you can convert them to the same case using 'string'.upper() or 'string'.lower()

In Arabic script, letters can have an initial, medial, or final form. Is there a similar way to compare strings of Arabic characters without caring which form the letters are in?

drs
  • 5,679
  • 4
  • 42
  • 67
  • Not directly. Are you actually trying to convert all of the characters to medial form, or are you doing that either (a) to do the Arabic equivalent of English case-insensitive comparison (or sorting, etc.), or (b) to generate the Arabic equivalent of English sentence case or title case? Because there _are_ ways to do those directly. – abarnert May 05 '15 at 01:11
  • @abarnert, I'm looking to do the former: the Arabic equivalent of English case-insensitive comparison. – drs May 05 '15 at 01:12
  • 1
    If you really _do_ need to convert the characters to medial form, you need to manually apply the information from the Unicode database. Python has a big chunk of the database in its `unicodedata` module; if you need more, you can download and parse the files from `unicode.org` or look for third-party modules on PyPI. (I'd have to check whether it has enough for this purpose…) – abarnert May 05 '15 at 01:12

1 Answers1

10

There are two parts to this, which should work for all languages:*

  • Your strings must be into NFKD normalization to guarantee that two equal strings have equal code units.
  • To ignore case in comparing two NFKD strings, use the Unicode case-folding algorithm.

Between the two, this handles English upper and lower case, Arabic initial/medial/final (plus isolated), German ß vs. ss, é as a single code point vs. e\N{COMBINING ACUTE ACCENT}, Chinese rotated characters, Japanese half-width kana, and probably all kinds of other things you haven't thought of.

In Python, that looks like this:

>>> s1 = 'ﻧ'
>>> s2 = 'ﻨ'
>>> unicodedata.normalize('NFKD', s1).casefold() == unicodedata.normalize('NFKD', s2)
True

Note that casefold wasn't added until Python 3.3. If you're using an earlier version of Python, there are implementations on PyPI; using them should be similar to using the 3.3+ builtin.


If you're interested in exactly how this works for Arabic, rather than just the fact that it works for Arabic along with every other language, you have read the algorithms and tables at unicode.org. IIRC, the W3C document that recommends doing this explains why it works using Arabic as an example. I believe it's because Unicode treats initial, medial, final, and isolated as compatibility-equivalent presentation forms of the same character, so normalizing to decomposed gives you effectively the isolated form plus a modifier that casefolding can skip or transform, even though casefolding directly on a combined character just returns the character itself.


* There are a few cases where two different languages or cultures use the same script, but have different case-folding rules; in that case, you need locale-specific casefolding, which Python doesn't include. But that shouldn't be relevant here.

abarnert
  • 354,177
  • 51
  • 601
  • 671
  • Thanks a lot. The best solution ever. You saved my life. – Arman May 14 '21 at 13:04
  • "Unicode treats initial, medial, final, and isolated as compatibility-equivalent presentation forms of the same character": Sort of. The characters encoded as code points in the Unicode Arabic block (0600--06FF) represent all forms of each character: initial, medial, final and isolated. It is up to the font and the display or printing mechanism to choose the right form, based on its environment. In addition, there are Unicode blocks for "Presentation Forms" which encode individual forms (where these exist), up in FB50--FEFF). I haven't seen real texts using these forms in years. – Mike Maxwell Aug 10 '22 at 02:59