Some characters look almost the same but have different code points. How do I know which one user has typed? For example, If I want to check whether user has entered 'é'
, should I test it against c3 a9
or 65 cc 81
? If I check both, how do I know there aren't other possibilities?
What I get in Python:
>>> b'\xc3\xa9'.decode('utf-8') == b'\x65\xcc\x81'.decode('utf-8')
>>> False
I think the same problem appears when you write a regex to match such characters. Normally, you don't see the encoded bytes in your text editor.