Converting between unicode forms with decomposition and composition in python

Question

How do I convert between Cl\u00e9s and Cle\u0301s for Clés in python 2.7.10

`\U0301s` isn't a valid character. Please correct the question. — Mark Ransom, Jan 28 '19 at 04:37
try putting `# -*- coding: utf8 -*-` at the start of your script..as a first line!! — Anwarvic, Jan 28 '19 at 05:53
@Anwarvic That declares the encoding of the source file only. It has no affect except for processing string constants with non-ASCII characters. — Mark Tolonen, Jan 28 '19 at 08:30

Mark Tolonen · Answer 1 · 2019-01-30T04:18:50.597

4

The unicodedata.normalize function converts Unicode strings to fully composed or fully decomposed forms.

>>> import unicodedata as ud
>>> d = u'Cle\u0301s'
>>> c = u'Cl\u00e9s'
>>> ud.normalize('NFC',c) # no change, already composed form
u'Cl\xe9s'                # Note: escape codes display with a smaller form if possible.
>>> ud.normalize('NFC',d) # changes to composed form
u'Cl\xe9s'
>>> ud.normalize('NFD',c) # changes to decomposed form
u'Cle\u0301s'
>>> ud.normalize('NFD',d) # no change, already decomposed form
u'Cle\u0301s'

If you are starting with byte strings in that format, the following will convert to Unicode strings first:

>>> db = 'Cle\u0301s'
>>> cb = 'Cl\u00e9s'
>>> d = db.decode('unicode_escape')
>>> c = cb.decode('unicode_escape')
>>> d
u'Cle\u0301s'
>>> c
u'Cl\xe9s'

edited Jan 30 '19 at 04:18

answered Jan 28 '19 at 08:33

Mark Tolonen

166,664
26
169
251

`def normalize(v1,uni,v2): print uni print v2 print unicodedata.normalize('NFC', v2)` I had tried this but I was getting this error:TypeError: must be unicode, not str. if I try with `unicodedata.normalize('NFC', unicode(v2))` I get the same back Mock-Frederic Associés. Mock-Frederic Associ\u00e9s. Mock-Frederic Associe\U0301s – Luc Teyssier Jan 28 '19 at 19:33
v1 being Cl\u00e9s , uni being Clés and v2 Cle\u0301s in the above example – Luc Teyssier Jan 28 '19 at 19:41
@LucTeyssier the error message is telling you the exact problem. You're not using a Unicode string, you're using a byte string. You need to `decode` that string first. – Mark Ransom Jan 29 '19 at 03:18
@MarkRansom, do you mean print unicodedata.normalize('NFC', v2.decode('utf-8')) ? that isn't changing the result for me. Thanks a lot for your help! – Luc Teyssier Jan 29 '19 at 05:15
@LucTeyssier if that isn't eliminating the error message then the error isn't on the `normalize` line but somewhere else. Just using `unicode(v2)` is not being specific enough about the source encoding. You can't just assume `utf-8` you need to know the actual encoding of your source data. – Mark Ransom Jan 29 '19 at 05:32
v2 already is in Cle\u0301s format so I don't understand why python is complaining about requiring unicode not str. How can I declare v2 to be unicode? If you could try an example of passing the variables to a method as my example, you may get the error I am getting. – Luc Teyssier Jan 29 '19 at 05:36
@MarkRansom that was a good call .."the error message then the error isn't on the normalize line but somewhere else." Remember I had to edit the question because I had a \U instead of \u. However due to not having direct text but variables so not able to do a d = u'Cle\u0301s' note the u bit in the beginning, the output of normalize NFC is still giving me cle\u0301s instead of cl\u00e9s – Luc Teyssier Jan 29 '19 at 19:23
@LucTeyssier does `len()` on that string give you 10 or 5? – Mark Ransom Jan 29 '19 at 20:08
couldn't add a comment as "comment too long", thanks for your help! – Luc Teyssier Jan 29 '19 at 21:08

score 0 · Answer 2 · answered Jan 29 '19 at 21:07

Thanks a million @MarkRansom for debugging this with me, got what I was looking for now!

    print uni
    >> Clés
    print v1.lower()
    >> cl\u00e9s
    print v2.lower()
    >> cle\u0301s

    print len(unicodedata.normalize('NFD', v1.lower().decode('UTF-8')))
    >> 9
    print len(unicodedata.normalize('NFC', v2.lower().decode('UTF-8')))
    >> 10

    print len(v1.lower().decode("unicode_escape"))
    >> 4

    print len(v2.lower().decode("unicode_escape"))
    >> 5

    print len(unicodedata.normalize('NFD', v1.lower().decode("unicode_escape")))
    >> 5
    print len(unicodedata.normalize('NFC', v2.lower().decode("unicode_escape")))
    >> 4

    print len(v1.lower().decode("unicode_escape"))
    >> 4

    print (v1.lower().decode("unicode_escape") == unicodedata.normalize('NFC', v2.lower().decode("unicode_escape")))
    >> True

Obviously lower() and upper() here will not be a good idea for most, but for me this works as I am expecting more or less the same word back from two different processes.

So what you actually had was a byte string `'Cl\u00e9s'` (length 9, not a single character escape code) instead of a Unicode string `u'Cl\u00e9s'` (length 4). Are you working with JSON? That's one of the common ways to have strings in that format. — Mark Tolonen, Jan 30 '19 at 04:24
Glad I was able to help. You might also want to check out [How do I do a case-insensitive string comparison?](https://stackoverflow.com/a/29247821/5987) — Mark Ransom, Jan 31 '19 at 16:47
Thanks that link is really useful, I was overlooking these cases with my solution. — Luc Teyssier, Feb 02 '19 at 08:07

Converting between unicode forms with decomposition and composition in python

2 Answers2