0

I have 2 strings:

s1 = "CATS"
s2 = "САTS"

While they look the same, they are not. Comparing them in python or whatever method will yield a result FALSE.

If I try encoding them using e.g. utf-8 in python:

s1 = s1.encode('utf-8')
s2 = s2.encode('utf-8')

and then print them

print(s1)
print(s2)

the result is

b'CATS'
b'\xd0\xa1\xd0\x90TS'

When I am comparing these 2 strings, I need to have result TRUE while using s1==s2. What should I do to achieve that? Many thanks for possible workaround.

  • 2
    How did you get these two strings. Obviously not from simple assignment statements. The second string shows two 2 bytes characters, followed by T and S. – Tarik Dec 04 '20 at 16:41
  • The second string uses cyrillic `A` and `C` and the other one is not, appearently. – Wisa Dec 04 '20 at 16:44
  • You may be interested in something like [Bi-directional transliterator for Python](https://pypi.org/project/transliterate/)? – JosefZ Dec 04 '20 at 16:46
  • I don't think the question is about normalizing the text. There was an assumption made that these two strings are the same and these two strings have been proven *not* to be the same. Normalizing the string would overlook this error and the OP hasn't clearly conveyed to us what the true objective is. If it *really* is shoehorning these two strings which aren't equal into being equal, then I would happily stand corrected. – Makoto Dec 04 '20 at 16:57
  • Well, yes, I didn't get them using standard method. These 2 are a part of huge output from Google VISION API OCR method... in my code I often use this to extract text from images, which are changing slightly over a time and in many cases there are strings which in one run are as my s1 output and in the second run as s2... And my task is to compare them with previous outputs to track changes and due to inconsistency it is a disaster. – Lukáš Maliar Dec 04 '20 at 17:29

1 Answers1

1

It looks like they are different characters with different Unicode values.

>>> s1 = "CATS"
>>> s2 = "САTS"
>>> ord(s1[0])  # Latin Capital Letter C
67
>>> ord(s2[0])  # Cyrillic Capital Letter Es
1057
snakecharmerb
  • 47,570
  • 11
  • 100
  • 153
Diptangsu Goswami
  • 5,554
  • 3
  • 25
  • 36