2

I need to filter some illegal strings like "Password", but I found someone bypassed my check program. They input a string that seems exactly "Password" but it's not equal. I checked the Unicode of it and, for example, the "a" is 8e61, while normal "a" is 61 (hex). My PHP files' encoding, HTML meta Content-Type and MySQL encoding are utf-8.

How does this happen? Why there're visually identical characters with different codes? I want to know how can I filter these characters. I put the weird string here, please copy it for research: Password


For some reason when I copied the "Password" with problem here, it actually displayed ASCII one.

I use PHP function bin2hex() on "Password", and get below:

50c28e61c28e73c28e73c28e776fc28e72c28e64c28e

while a normal one is:

50617373776f7264.

To make it simpler, the hexadecimal representation for "a" is:

c28e61

while normal one is:

61
Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
Lucas Lee
  • 21
  • 2
  • Welcome to Stack Overflow. Please read the [About] page soon. Welcome to the wonderful world of Unicode, too. There are a lot of characters with multiple representations. For a semi-exotic example, the Arabic digit one is encoded twice, once for western Arabic U+0660 and once for eastern Arabic U+06F0, but the symbol is the same; it is some of the other digits that differ. See [In Unicode, why are there two representations for the Arabic digits](http://stackoverflow.com/questions/1676460/). You'll have to decide whether you're going to treat U+8E61 the same as U+0061 _[...continued...]_ – Jonathan Leffler Jul 16 '13 at 04:03
  • _[...continuation...]_ Hold on; U+8E61 is a Unified Han symbol. Which code page are you using? 0x8E61 is not valid UTF-8; the 0x8E is a continuation byte, and the 0x61 is LATIN SMALL LETTER A, which can't be followed by a continuation byte. You've not given all the information we need; what is the entire byte sequence you're dealing with? The comments above are still accurate and more or less relevant, but you are unlikely to be treating U+8E61 as if it was U+0061. – Jonathan Leffler Jul 16 '13 at 04:09
  • I copied your string and it is identified as containing: `0x0000: 50 61 73 73 77 6F 72 64 Password`. That's the regular ASCII representation of Password. So either your copy/paste didn't preserve the odd characters, or mine didn't. I'm working on a Mac. Can you identify the bytes you think you have in hex? – Jonathan Leffler Jul 16 '13 at 04:12
  • (Oops: U+0660 and U+06F0 are Arabic zeroes, not ones; U+0661 and U+06F1 are the ones.) – Jonathan Leffler Jul 16 '13 at 04:31
  • @Jonathan Leffler, hex string provided, thanks – Lucas Lee Jul 16 '13 at 05:31

2 Answers2

1

Given the hex string 50c28e61c28e73c28e73c28e776fc28e72c28e64c28e, you have an encoding of a valid UTF-8 string:

0x50      = U+0050 = P
0xC2 0x8E = U+008E = SS2
0x61      = U+0061 = a
0xC2 0x8E = U+008E = SS2
0x73      = U+0073 = s
0xC2 0x8E = U+008E = SS2
0x73      = U+0073 = s
0xC2 0x8E = U+008E = SS2
0x77      = U+0077 = w
0x6F      = U+006F = o
0xC2 0x8E = U+008E = SS2
0x72      = U+0072 = r
0xC2 0x8E = U+008E = SS2
0x64      = U+0064 = d
0xC2 0x8E = U+008E = SS2

The 0xC2 0x8E sequence maps to ISO 8859-1 0x8E, which is a control character SS2 or Single Shift 2 (see Unicode Code Charts). SS2 doesn't have a defined visible representation. The string is clearly different from plain 'Password'. As long as you don't strip out control characters, you should be able to spot the difference as a string comparison should not treat that as identical to plain 'Password'.

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
  • Thank you ! How to remove this character in PHP, or this kind of characters? I searched some, like this http://stackoverflow.com/questions/1176904/php-how-to-remove-all-non-printable-characters-in-a-string, but they can't remove this character. – Lucas Lee Jul 16 '13 at 06:37
  • I've found the solution to remove it here: http://stackoverflow.com/questions/3295125/preg-replace-to-strip-out-non-printing-characters-seems-to-remove-all-foreign-ch – Lucas Lee Jul 16 '13 at 06:44
0

What you might be seeing (I can't tell exactly because parts of your question don't make sense or are inconsistent) are so-called homoglyphs. Those are characters that look identical or very similar and thus can be mistaken at first glance. To circumvent your check people can use a Cyrillic a and get away with it. But frankly, this isn't actually a problem because I know no password cracker that will actually try mixing scripts, as most passwords are ASCII-only.

As for the why, you can take a look at Why are there duplicate characters in Unicode?.

Community
  • 1
  • 1
Joey
  • 344,408
  • 85
  • 689
  • 683