Combined diacritics do not normalize with unicodedata.normalize (PYTHON)

Question

I understand that unicodedata.normalize converts diacritics to their non diacritic counterparts:

import unicodedata
''.join( c for c in unicodedata.normalize('NFD', u'B\u0153uf') 
            if unicodedata.category(c) != 'Mn'
       )

My question is (and can be seen in this example): does unicodedata has a way to replace combined char diacritics into their counterparts? (u'œ' becomes 'oe')

If not I assume I will have to put a hit out for these, but then I might as well compile my own dict with all uchars and their counterparts and forget about unicodedata altogether...

score 6 · Accepted Answer · answered Sep 12 '12 at 15:57

There's a bit of confusion about terminology in your question. A diacritic is a mark that can be added to a letter or other character but generally does not stand on its own. (Unicode also uses the more general term combining character.) What normalize('NFD', ...) does is to convert precomposed characters into their components.

Anyway, the answer is that œ is not a precomposed character. It's a typographic ligature:

>>> unicodedata.name(u'\u0153')
'LATIN SMALL LIGATURE OE'

The unicodedata module provides no method for splitting ligatures into their parts. But the data is there in the character names:

import re
import unicodedata

_ligature_re = re.compile(r'LATIN (?:(CAPITAL)|SMALL) LIGATURE ([A-Z]{2,})')

def split_ligatures(s):
    """
    Split the ligatures in `s` into their component letters. 
    """
    def untie(l):
        m = _ligature_re.match(unicodedata.name(l))
        if not m: return l
        elif m.group(1): return m.group(2)
        else: return m.group(2).lower()
    return ''.join(untie(l) for l in s)

>>> split_ligatures(u'B\u0153uf \u0132sselmeer \uFB00otogra\uFB00')
u'Boeuf IJsselmeer ffotograff'

(Of course you wouldn't do it like this in practice: you'd preprocess the Unicode database to generate a lookup table as you suggest in your question. There aren't all that many ligatures in Unicode.)

WARNING: The approach based on the Unicode name containing "LIGATURE" is not robust. It appears that some ligatures don't have "LIGATURE" in their name string. For example, unicodedata.name(u'\xc6') -> 'LATIN CAPITAL LETTER AE'. — Scott H, Jan 09 '15 at 16:53
There's also ß (U+00DF), which is called "LATIN SMALL LETTER SHARP S" but can be thought of as a double-S ligature. — celticminstrel, Jan 20 '16 at 03:03
@GarethRees: Keep your answer, it's useful. By my count, unicodedata has over 500 code points with ligature in the name (based on ftp://ftp.unicode.org/Public/5.2.0/ucd/NamesList.txt), though many of those are for other languages. I just mentioned my warning to let people know there are some corner cases. — Scott H, Jan 22 '16 at 22:34

Combined diacritics do not normalize with unicodedata.normalize (PYTHON)

1 Answers1

Linked