0

I have a string whose value is 'Opérations'. In my script I will read a file and do some comparisons. While comparing strings, the string that I have copied from the same source and placed in my python script DOES not equal to the same string that I receive when reading the same file in my script. Printing both strings give me 'Opérations'. However, when I encode it to utf-8 I notice the difference.

  • b'Ope\xcc\x81rations'
  • b'Op\xc3\xa9rations'

My question is what do I do to ensure that the special character in my python script is the same as the file content's when comparing such strings.

A.H
  • 85
  • 8
  • 2
    Which version of python do you use? You tagged both python-3x and python-2.7. If using python2 and "place" the string in the script, what encoding do you specify for the source code? – buran Dec 19 '18 at 11:19
  • 1
    Possible duplicate of [Match accentuated strings in lists of string in Python 3](https://stackoverflow.com/questions/52994408/match-accentuated-strings-in-lists-of-string-in-python-3) – Jongware Dec 19 '18 at 11:19

1 Answers1

0

Good to know:

You are talking about two type of strings, byte string and unicode string. Each have a method to convert it to the other type of string. Unicode strings have a .encode() method that produces bytes, and byte strings have a .decode() method that produces unicode. It means:

unicode.enocde() ----> bytes

and

bytes.decode() -----> unicode

and UTF-8 is easily the most popular encoding for storage and transmission of Unicode. It uses a variable number of bytes for each code point. The higher the code point value, the more bytes it needs in UTF-8.

Get to the point:

If you redefine your string to two Byte strings and unicode strings, as follwos:

a_byte = b'Ope\xcc\x81rations'
a_unicode = u'Ope\xcc\x81rations'

and

b_byte = b'Op\xc3\xa9rations'
b_unicode = u'Op\xc3\xa9rations'

you w'll see:

print 'a_byte lenght is: ', len(a_byte.decode("utf-8"))
#print 'a_unicode lenght is: ',len(a_unicode.encode("utf-8"))

print 'b_byte lenght is: ',len(b_byte.decode("utf-8"))
#print 'b_unicode lenght is: ', len(b_unicode.encode("utf-8"))

output:

a_byte lenght is:  11
b_byte lenght is:  10

So you see they are not the same.

My solution:

If You don't want to be confused, then you can use repr(), and while print a_byte, b_byte printes Opérations as output, but:

print repr(a_byte),repr(b_byte)

will return:

'Ope\xcc\x81rations','Op\xc3\xa9rations'

You can also normalize the unicode before comparison as @Daniel's answer, as follows:

from unicodedata import normalize
from functools import partial
a_byte = 'Opérations'
norm = partial(normalize, 'NFC')
your_string = norm(a_byte.decode('utf8'))
Kian
  • 1,319
  • 1
  • 13
  • 23