Identical looking string but different bytes representation

Question

The upper string is typed by me while the bottom string is pulled from a database.

bytes('TOYOTA', 'utf-8')
>> b'TOYOTA'

bytes('ΤΟΥΟΤΑ', 'utf-8')
>> b'\xce\xa4\xce\x9f\xce\xa5\xce\x9f\xce\xa4\xce\x91'

This causes undesirable results when I want to check for its existence

'TOYOTA' == 'ΤΟΥΟΤΑ'
>> False

Any idea how to "fix" the incorrect string?

These are NOT identical strings. These only **look** similar. Search for one i. e. Using browser search and you won't find both — Marcin Orlowski, Aug 28 '20 at 15:06
@MarcinOrlowski Depending on the font, or application, they actually look identical. In my web browser they look identical, while in the terminal they look quite different. — mkrieger1, Aug 28 '20 at 15:08
Which is the "incorrect" string? If the actual content of the database is Greek text, surely you don't want to replace the letters and corrupt the data, right? So the only problem here is the search query. `TOYOTA` *is not* `ΤΟΥΟΤΑ` no matter how similar they look. — trent, Aug 28 '20 at 21:46

score 2 · Accepted Answer · answered Aug 28 '20 at 15:01

It appears those are Greek capital letters:

>>> import unicodedata
>>> s = 'ΤΟΥΟΤΑ'
>>> for c in s:
...     print(unicodedata.name(c))
... 
GREEK CAPITAL LETTER TAU
GREEK CAPITAL LETTER OMICRON
GREEK CAPITAL LETTER UPSILON
GREEK CAPITAL LETTER OMICRON
GREEK CAPITAL LETTER TAU
GREEK CAPITAL LETTER ALPHA

You could try to use one of the available third-party libraries to do a transliteration to the Latin alphabet, for example:

This is a similar question: How can I create a string in english letters from another language word?

I just need a simple existence check and this solution works. Now I know I can check the unicode name of each character in python. Thanks! — Zhi Qin Tan, Aug 29 '20 at 16:54

Identical looking string but different bytes representation

1 Answers1