How do I convert between Cl\u00e9s and Cle\u0301s for Clés in python 2.7.10
Asked
Active
Viewed 845 times
1
-
`\U0301s` isn't a valid character. Please correct the question. – Mark Ransom Jan 28 '19 at 04:37
-
try putting `# -*- coding: utf8 -*-` at the start of your script..as a first line!! – Anwarvic Jan 28 '19 at 05:53
-
1@Anwarvic That declares the encoding of the source file only. It has no affect except for processing string constants with non-ASCII characters. – Mark Tolonen Jan 28 '19 at 08:30
-
yeah I know, I thought that was the problem! – Anwarvic Jan 28 '19 at 08:41
2 Answers
4
The unicodedata.normalize
function converts Unicode strings to fully composed or fully decomposed forms.
>>> import unicodedata as ud
>>> d = u'Cle\u0301s'
>>> c = u'Cl\u00e9s'
>>> ud.normalize('NFC',c) # no change, already composed form
u'Cl\xe9s' # Note: escape codes display with a smaller form if possible.
>>> ud.normalize('NFC',d) # changes to composed form
u'Cl\xe9s'
>>> ud.normalize('NFD',c) # changes to decomposed form
u'Cle\u0301s'
>>> ud.normalize('NFD',d) # no change, already decomposed form
u'Cle\u0301s'
If you are starting with byte strings in that format, the following will convert to Unicode strings first:
>>> db = 'Cle\u0301s'
>>> cb = 'Cl\u00e9s'
>>> d = db.decode('unicode_escape')
>>> c = cb.decode('unicode_escape')
>>> d
u'Cle\u0301s'
>>> c
u'Cl\xe9s'

Mark Tolonen
- 166,664
- 26
- 169
- 251
-
`def normalize(v1,uni,v2): print uni print v2 print unicodedata.normalize('NFC', v2)` I had tried this but I was getting this error:TypeError: must be unicode, not str. if I try with `unicodedata.normalize('NFC', unicode(v2))` I get the same back Mock-Frederic Associés. Mock-Frederic Associ\u00e9s. Mock-Frederic Associe\U0301s – Luc Teyssier Jan 28 '19 at 19:33
-
v1 being Cl\u00e9s , uni being Clés and v2 Cle\u0301s in the above example – Luc Teyssier Jan 28 '19 at 19:41
-
@LucTeyssier the error message is telling you the exact problem. You're not using a Unicode string, you're using a byte string. You need to `decode` that string first. – Mark Ransom Jan 29 '19 at 03:18
-
@MarkRansom, do you mean print unicodedata.normalize('NFC', v2.decode('utf-8')) ? that isn't changing the result for me. Thanks a lot for your help! – Luc Teyssier Jan 29 '19 at 05:15
-
@LucTeyssier if that isn't eliminating the error message then the error isn't on the `normalize` line but somewhere else. Just using `unicode(v2)` is not being specific enough about the source encoding. You can't just assume `utf-8` you need to know the actual encoding of your source data. – Mark Ransom Jan 29 '19 at 05:32
-
v2 already is in Cle\u0301s format so I don't understand why python is complaining about requiring unicode not str. How can I declare v2 to be unicode? If you could try an example of passing the variables to a method as my example, you may get the error I am getting. – Luc Teyssier Jan 29 '19 at 05:36
-
@MarkRansom that was a good call .."the error message then the error isn't on the normalize line but somewhere else." Remember I had to edit the question because I had a \U instead of \u. However due to not having direct text but variables so not able to do a d = u'Cle\u0301s' note the u bit in the beginning, the output of normalize NFC is still giving me cle\u0301s instead of cl\u00e9s – Luc Teyssier Jan 29 '19 at 19:23
-
-
couldn't add a comment as "comment too long", thanks for your help! – Luc Teyssier Jan 29 '19 at 21:08
0
Thanks a million @MarkRansom for debugging this with me, got what I was looking for now!
print uni
>> Clés
print v1.lower()
>> cl\u00e9s
print v2.lower()
>> cle\u0301s
print len(unicodedata.normalize('NFD', v1.lower().decode('UTF-8')))
>> 9
print len(unicodedata.normalize('NFC', v2.lower().decode('UTF-8')))
>> 10
print len(v1.lower().decode("unicode_escape"))
>> 4
print len(v2.lower().decode("unicode_escape"))
>> 5
print len(unicodedata.normalize('NFD', v1.lower().decode("unicode_escape")))
>> 5
print len(unicodedata.normalize('NFC', v2.lower().decode("unicode_escape")))
>> 4
print len(v1.lower().decode("unicode_escape"))
>> 4
print (v1.lower().decode("unicode_escape") == unicodedata.normalize('NFC', v2.lower().decode("unicode_escape")))
>> True
Obviously lower() and upper() here will not be a good idea for most, but for me this works as I am expecting more or less the same word back from two different processes.

Luc Teyssier
- 11
- 2
- 5
-
So what you actually had was a byte string `'Cl\u00e9s'` (length 9, not a single character escape code) instead of a Unicode string `u'Cl\u00e9s'` (length 4). Are you working with JSON? That's one of the common ways to have strings in that format. – Mark Tolonen Jan 30 '19 at 04:24
-
Glad I was able to help. You might also want to check out [How do I do a case-insensitive string comparison?](https://stackoverflow.com/a/29247821/5987) – Mark Ransom Jan 31 '19 at 16:47
-
Thanks that link is really useful, I was overlooking these cases with my solution. – Luc Teyssier Feb 02 '19 at 08:07