1

How do I convert between Cl\u00e9s and Cle\u0301s for Clés in python 2.7.10

Luc Teyssier
  • 11
  • 2
  • 5

2 Answers2

4

The unicodedata.normalize function converts Unicode strings to fully composed or fully decomposed forms.

>>> import unicodedata as ud
>>> d = u'Cle\u0301s'
>>> c = u'Cl\u00e9s'
>>> ud.normalize('NFC',c) # no change, already composed form
u'Cl\xe9s'                # Note: escape codes display with a smaller form if possible.
>>> ud.normalize('NFC',d) # changes to composed form
u'Cl\xe9s'
>>> ud.normalize('NFD',c) # changes to decomposed form
u'Cle\u0301s'
>>> ud.normalize('NFD',d) # no change, already decomposed form
u'Cle\u0301s'

If you are starting with byte strings in that format, the following will convert to Unicode strings first:

>>> db = 'Cle\u0301s'
>>> cb = 'Cl\u00e9s'
>>> d = db.decode('unicode_escape')
>>> c = cb.decode('unicode_escape')
>>> d
u'Cle\u0301s'
>>> c
u'Cl\xe9s'
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
  • `def normalize(v1,uni,v2): print uni print v2 print unicodedata.normalize('NFC', v2)` I had tried this but I was getting this error:TypeError: must be unicode, not str. if I try with `unicodedata.normalize('NFC', unicode(v2))` I get the same back Mock-Frederic Associés. Mock-Frederic Associ\u00e9s. Mock-Frederic Associe\U0301s – Luc Teyssier Jan 28 '19 at 19:33
  • v1 being Cl\u00e9s , uni being Clés and v2 Cle\u0301s in the above example – Luc Teyssier Jan 28 '19 at 19:41
  • @LucTeyssier the error message is telling you the exact problem. You're not using a Unicode string, you're using a byte string. You need to `decode` that string first. – Mark Ransom Jan 29 '19 at 03:18
  • @MarkRansom, do you mean print unicodedata.normalize('NFC', v2.decode('utf-8')) ? that isn't changing the result for me. Thanks a lot for your help! – Luc Teyssier Jan 29 '19 at 05:15
  • @LucTeyssier if that isn't eliminating the error message then the error isn't on the `normalize` line but somewhere else. Just using `unicode(v2)` is not being specific enough about the source encoding. You can't just assume `utf-8` you need to know the actual encoding of your source data. – Mark Ransom Jan 29 '19 at 05:32
  • v2 already is in Cle\u0301s format so I don't understand why python is complaining about requiring unicode not str. How can I declare v2 to be unicode? If you could try an example of passing the variables to a method as my example, you may get the error I am getting. – Luc Teyssier Jan 29 '19 at 05:36
  • @MarkRansom that was a good call .."the error message then the error isn't on the normalize line but somewhere else." Remember I had to edit the question because I had a \U instead of \u. However due to not having direct text but variables so not able to do a d = u'Cle\u0301s' note the u bit in the beginning, the output of normalize NFC is still giving me cle\u0301s instead of cl\u00e9s – Luc Teyssier Jan 29 '19 at 19:23
  • @LucTeyssier does `len()` on that string give you 10 or 5? – Mark Ransom Jan 29 '19 at 20:08
  • couldn't add a comment as "comment too long", thanks for your help! – Luc Teyssier Jan 29 '19 at 21:08
0

Thanks a million @MarkRansom for debugging this with me, got what I was looking for now!

    print uni
    >> Clés
    print v1.lower()
    >> cl\u00e9s
    print v2.lower()
    >> cle\u0301s

    print len(unicodedata.normalize('NFD', v1.lower().decode('UTF-8')))
    >> 9
    print len(unicodedata.normalize('NFC', v2.lower().decode('UTF-8')))
    >> 10

    print len(v1.lower().decode("unicode_escape"))
    >> 4

    print len(v2.lower().decode("unicode_escape"))
    >> 5

    print len(unicodedata.normalize('NFD', v1.lower().decode("unicode_escape")))
    >> 5
    print len(unicodedata.normalize('NFC', v2.lower().decode("unicode_escape")))
    >> 4

    print len(v1.lower().decode("unicode_escape"))
    >> 4

    print (v1.lower().decode("unicode_escape") == unicodedata.normalize('NFC', v2.lower().decode("unicode_escape")))
    >> True

Obviously lower() and upper() here will not be a good idea for most, but for me this works as I am expecting more or less the same word back from two different processes.

Luc Teyssier
  • 11
  • 2
  • 5
  • So what you actually had was a byte string `'Cl\u00e9s'` (length 9, not a single character escape code) instead of a Unicode string `u'Cl\u00e9s'` (length 4). Are you working with JSON? That's one of the common ways to have strings in that format. – Mark Tolonen Jan 30 '19 at 04:24
  • Glad I was able to help. You might also want to check out [How do I do a case-insensitive string comparison?](https://stackoverflow.com/a/29247821/5987) – Mark Ransom Jan 31 '19 at 16:47
  • Thanks that link is really useful, I was overlooking these cases with my solution. – Luc Teyssier Feb 02 '19 at 08:07