How to compare the same character with multiple code points?

Question

Some characters look almost the same but have different code points. How do I know which one user has typed? For example, If I want to check whether user has entered 'é', should I test it against c3 a9 or 65 cc 81? If I check both, how do I know there aren't other possibilities?

What I get in Python:

>>> b'\xc3\xa9'.decode('utf-8') == b'\x65\xcc\x81'.decode('utf-8')
>>> False

I think the same problem appears when you write a regex to match such characters. Normally, you don't see the encoded bytes in your text editor.

But you're using *the same* encoding against those two characters, not two different encodings. — Robert Harvey, Jan 03 '19 at 19:41
@RobertHarvey I might be bad at phrasing. But the same user-perceived character has different code points, and I don't know which one to test for. That's the problem. — Cyker, Jan 03 '19 at 19:42
Looks like it's [0xC3 0xA9](http://www.fileformat.info/info/unicode/char/e9/index.htm). Does some `é` exist other than that one? — Robert Harvey, Jan 03 '19 at 19:47
Decomposition: LATIN SMALL LETTER E (U+0065) COMBINING ACUTE ACCENT (U+0301) — Robert Harvey, Jan 03 '19 at 19:48
@RobertHarvey Yes but they look the same. In fact if you remove the `==` and print each, then you can't tell their difference. — Cyker, Jan 03 '19 at 19:48
Do you have any evidence that you'll ever see anything from the user other than 0xC3 0xA9? — Robert Harvey, Jan 03 '19 at 19:49
@RobertHarvey Not yet but unless you assure me all of them will be entering `c3 a9` using whatever input method they have this is a potential bug. — Cyker, Jan 03 '19 at 19:50
I don't fix "potential bugs," though I have been known to employ defensive coding techniques from time to time. This one just feels like overthinking; unless you can find a way to test it, I'd go with what the definition says it is until a problem turns up. — Robert Harvey, Jan 03 '19 at 19:51
@RobertHarvey Then they come true. In fact I don't see why you decide `c3 a9` is more widely used than the other. Where did you get that statistics? — Cyker, Jan 03 '19 at 19:52
That's the UTF definition for `é`. It's in [the page I linked above](http://www.fileformat.info/info/unicode/char/e9/index.htm). — Robert Harvey, Jan 03 '19 at 20:00
What you're looking for here is [unicode normalization](http://unicode.org/reports/tr15/). — user3942918, Jan 03 '19 at 20:16
@RobertHarvey I think that's not a definition for `é` but for `U+00E9`. — Cyker, Jan 03 '19 at 21:19
@PaulCrovella That looks like the solution. Do you plan to write a short answer so that I can accept it? — Cyker, Jan 03 '19 at 21:21
The answer I'd want to write on this would be rather long. If you've got a short one that's more than just a link please have at it. — user3942918, Jan 03 '19 at 21:29
@PaulCrovella Following your keywords I found this great [answer](https://stackoverflow.com/questions/16467479/normalizing-unicode). — Cyker, Jan 03 '19 at 21:31

How to compare the same character with multiple code points?

0 Answers0