Two characters seem identical but UTF-8 encodings are not identical

Question

I need to filter some illegal strings like "Password", but I found someone bypassed my check program. They input a string that seems exactly "Password" but it's not equal. I checked the Unicode of it and, for example, the "a" is 8e61, while normal "a" is 61 (hex). My PHP files' encoding, HTML meta Content-Type and MySQL encoding are utf-8.

How does this happen? Why there're visually identical characters with different codes? I want to know how can I filter these characters. I put the weird string here, please copy it for research: Password

For some reason when I copied the "Password" with problem here, it actually displayed ASCII one.

I use PHP function bin2hex() on "Password", and get below:

50c28e61c28e73c28e73c28e776fc28e72c28e64c28e

while a normal one is:

50617373776f7264.

To make it simpler, the hexadecimal representation for "a" is:

c28e61

while normal one is:

Welcome to Stack Overflow. Please read the [About] page soon. Welcome to the wonderful world of Unicode, too. There are a lot of characters with multiple representations. For a semi-exotic example, the Arabic digit one is encoded twice, once for western Arabic U+0660 and once for eastern Arabic U+06F0, but the symbol is the same; it is some of the other digits that differ. See [In Unicode, why are there two representations for the Arabic digits](http://stackoverflow.com/questions/1676460/). You'll have to decide whether you're going to treat U+8E61 the same as U+0061 _[...continued...]_ — Jonathan Leffler, Jul 16 '13 at 04:03
_[...continuation...]_ Hold on; U+8E61 is a Unified Han symbol. Which code page are you using? 0x8E61 is not valid UTF-8; the 0x8E is a continuation byte, and the 0x61 is LATIN SMALL LETTER A, which can't be followed by a continuation byte. You've not given all the information we need; what is the entire byte sequence you're dealing with? The comments above are still accurate and more or less relevant, but you are unlikely to be treating U+8E61 as if it was U+0061. — Jonathan Leffler, Jul 16 '13 at 04:09
I copied your string and it is identified as containing: `0x0000: 50 61 73 73 77 6F 72 64 Password`. That's the regular ASCII representation of Password. So either your copy/paste didn't preserve the odd characters, or mine didn't. I'm working on a Mac. Can you identify the bytes you think you have in hex? — Jonathan Leffler, Jul 16 '13 at 04:12
(Oops: U+0660 and U+06F0 are Arabic zeroes, not ones; U+0661 and U+06F1 are the ones.) — Jonathan Leffler, Jul 16 '13 at 04:31

Jonathan Leffler · Answer 1 · 2013-07-16T05:49:20.023

1

Given the hex string 50c28e61c28e73c28e73c28e776fc28e72c28e64c28e, you have an encoding of a valid UTF-8 string:

0x50      = U+0050 = P
0xC2 0x8E = U+008E = SS2
0x61      = U+0061 = a
0xC2 0x8E = U+008E = SS2
0x73      = U+0073 = s
0xC2 0x8E = U+008E = SS2
0x73      = U+0073 = s
0xC2 0x8E = U+008E = SS2
0x77      = U+0077 = w
0x6F      = U+006F = o
0xC2 0x8E = U+008E = SS2
0x72      = U+0072 = r
0xC2 0x8E = U+008E = SS2
0x64      = U+0064 = d
0xC2 0x8E = U+008E = SS2

The 0xC2 0x8E sequence maps to ISO 8859-1 0x8E, which is a control character SS2 or Single Shift 2 (see Unicode Code Charts). SS2 doesn't have a defined visible representation. The string is clearly different from plain 'Password'. As long as you don't strip out control characters, you should be able to spot the difference as a string comparison should not treat that as identical to plain 'Password'.

edited Jul 16 '13 at 05:49

answered Jul 16 '13 at 05:44

Jonathan Leffler

730,956
141
904
1,278

Thank you ! How to remove this character in PHP, or this kind of characters? I searched some, like this http://stackoverflow.com/questions/1176904/php-how-to-remove-all-non-printable-characters-in-a-string, but they can't remove this character. – Lucas Lee Jul 16 '13 at 06:37
I've found the solution to remove it here: http://stackoverflow.com/questions/3295125/preg-replace-to-strip-out-non-printing-characters-seems-to-remove-all-foreign-ch – Lucas Lee Jul 16 '13 at 06:44

score 0 · Answer 2 · edited May 23 '17 at 11:43

What you might be seeing (I can't tell exactly because parts of your question don't make sense or are inconsistent) are so-called homoglyphs. Those are characters that look identical or very similar and thus can be mistaken at first glance. To circumvent your check people can use a Cyrillic a and get away with it. But frankly, this isn't actually a problem because I know no password cracker that will actually try mixing scripts, as most passwords are ASCII-only.

As for the why, you can take a look at Why are there duplicate characters in Unicode?.

Two characters seem identical but UTF-8 encodings are not identical

2 Answers2