Comparing special characters in Python

Question

I have a string whose value is 'Opérations'. In my script I will read a file and do some comparisons. While comparing strings, the string that I have copied from the same source and placed in my python script DOES not equal to the same string that I receive when reading the same file in my script. Printing both strings give me 'Opérations'. However, when I encode it to utf-8 I notice the difference.

b'Ope\xcc\x81rations'
b'Op\xc3\xa9rations'

My question is what do I do to ensure that the special character in my python script is the same as the file content's when comparing such strings.

Which version of python do you use? You tagged both python-3x and python-2.7. If using python2 and "place" the string in the script, what encoding do you specify for the source code? — buran, Dec 19 '18 at 11:19
Possible duplicate of [Match accentuated strings in lists of string in Python 3](https://stackoverflow.com/questions/52994408/match-accentuated-strings-in-lists-of-string-in-python-3) — Jongware, Dec 19 '18 at 11:19

Kian · Answer 1 · 2018-12-19T16:02:57.527

Good to know:

You are talking about two type of strings, byte string and unicode string. Each have a method to convert it to the other type of string. Unicode strings have a .encode() method that produces bytes, and byte strings have a .decode() method that produces unicode. It means:

unicode.enocde() ----> bytes

and

bytes.decode() -----> unicode

and UTF-8 is easily the most popular encoding for storage and transmission of Unicode. It uses a variable number of bytes for each code point. The higher the code point value, the more bytes it needs in UTF-8.

Get to the point:

If you redefine your string to two Byte strings and unicode strings, as follwos:

a_byte = b'Ope\xcc\x81rations'
a_unicode = u'Ope\xcc\x81rations'

and

b_byte = b'Op\xc3\xa9rations'
b_unicode = u'Op\xc3\xa9rations'

you w'll see:

print 'a_byte lenght is: ', len(a_byte.decode("utf-8"))
#print 'a_unicode lenght is: ',len(a_unicode.encode("utf-8"))

print 'b_byte lenght is: ',len(b_byte.decode("utf-8"))
#print 'b_unicode lenght is: ', len(b_unicode.encode("utf-8"))

output:

a_byte lenght is:  11
b_byte lenght is:  10

So you see they are not the same.

My solution:

If You don't want to be confused, then you can use repr(), and while print a_byte, b_byte printes Opérations as output, but:

print repr(a_byte),repr(b_byte)

will return:

'Ope\xcc\x81rations','Op\xc3\xa9rations'

You can also normalize the unicode before comparison as @Daniel's answer, as follows:

from unicodedata import normalize
from functools import partial
a_byte = 'Opérations'
norm = partial(normalize, 'NFC')
your_string = norm(a_byte.decode('utf8'))

Comparing special characters in Python

1 Answers1

Linked