1

Some characters look almost the same but have different code points. How do I know which one user has typed? For example, If I want to check whether user has entered 'é', should I test it against c3 a9 or 65 cc 81? If I check both, how do I know there aren't other possibilities?

What I get in Python:

>>> b'\xc3\xa9'.decode('utf-8') == b'\x65\xcc\x81'.decode('utf-8')
>>> False

I think the same problem appears when you write a regex to match such characters. Normally, you don't see the encoded bytes in your text editor.

Cyker
  • 9,946
  • 8
  • 65
  • 93
  • But you're using *the same* encoding against those two characters, not two different encodings. – Robert Harvey Jan 03 '19 at 19:41
  • @RobertHarvey I might be bad at phrasing. But the same user-perceived character has different code points, and I don't know which one to test for. That's the problem. – Cyker Jan 03 '19 at 19:42
  • Looks like it's [0xC3 0xA9](http://www.fileformat.info/info/unicode/char/e9/index.htm). Does some `é` exist other than that one? – Robert Harvey Jan 03 '19 at 19:47
  • 1
    @RobertHarvey I just wrote another in the question? – Cyker Jan 03 '19 at 19:47
  • Decomposition: LATIN SMALL LETTER E (U+0065) COMBINING ACUTE ACCENT (U+0301) – Robert Harvey Jan 03 '19 at 19:48
  • @RobertHarvey Yes but they look the same. In fact if you remove the `==` and print each, then you can't tell their difference. – Cyker Jan 03 '19 at 19:48
  • Do you have any evidence that you'll ever see anything from the user other than 0xC3 0xA9? – Robert Harvey Jan 03 '19 at 19:49
  • @RobertHarvey Not yet but unless you assure me all of them will be entering `c3 a9` using whatever input method they have this is a potential bug. – Cyker Jan 03 '19 at 19:50
  • I don't fix "potential bugs," though I have been known to employ defensive coding techniques from time to time. This one just feels like overthinking; unless you can find a way to test it, I'd go with what the definition says it is until a problem turns up. – Robert Harvey Jan 03 '19 at 19:51
  • @RobertHarvey Then they come true. In fact I don't see why you decide `c3 a9` is more widely used than the other. Where did you get that statistics? – Cyker Jan 03 '19 at 19:52
  • That's the UTF definition for `é`. It's in [the page I linked above](http://www.fileformat.info/info/unicode/char/e9/index.htm). – Robert Harvey Jan 03 '19 at 20:00
  • 1
    What you're looking for here is [unicode normalization](http://unicode.org/reports/tr15/). – user3942918 Jan 03 '19 at 20:16
  • @RobertHarvey I think that's not a definition for `é` but for `U+00E9`. – Cyker Jan 03 '19 at 21:19
  • @PaulCrovella That looks like the solution. Do you plan to write a short answer so that I can accept it? – Cyker Jan 03 '19 at 21:21
  • The answer I'd want to write on this would be rather long. If you've got a short one that's more than just a link please have at it. – user3942918 Jan 03 '19 at 21:29
  • @PaulCrovella Following your keywords I found this great [answer](https://stackoverflow.com/questions/16467479/normalizing-unicode). – Cyker Jan 03 '19 at 21:31
  • Cool. Glad I could help. – user3942918 Jan 03 '19 at 21:33

0 Answers0