Allow only letters and digits in strings but without confusables

Question

Say I want usernames to only consist of letters and digits regardless of language.

I think I might accomplish this with the following regex parts

(?>\p{L}[\p{Mn}\p{Mc}]*) //match any letter, including those consisting of two code points

\p{Nd} //match any digit

Now I have the problem that users may pretend to be other users by using a username that has the same look like the one from another user (homograph attack). admin vs ａdmin would be an example.

I guess it's not possible to easily exclude characters that are both letters and confusables using a regex but how about outside the context of the regexes. Do the unicode ids of confusables lie in certain ranges that we could filter or something like that?

There are libraries for this functionality; they collect homographs in large tables and compile them into a single regex. — Bergi, Oct 04 '14 at 17:41

score 0 · Answer 1 · answered Oct 04 '14 at 17:57

0

Confusables... Then it comes to mind that you are talking about Cyrillic characters. If that's right, you can easily exclude them from your RegEx. Consider following ranges:

Cyrillic: U+0400–U+04FF, 256 characters

Cyrillic Supplement: U+0500–U+052F, 48 characters

Cyrillic Extended-A: U+2DE0–U+2DFF, 32 characters

Cyrillic Extended-B: U+A640–U+A69F, 96 characters

Phonetic Extensions: U+1D2B, U+1D78, 2 Cyrillic characters

Then:

/[^\x{0400}-\x{04FF}\x{0500}-\x{052F}\x{2DE0}-\x{2DFF}\x{A640}-\x{A69F}\x{1D2B}\x{1D78}]/u

Or simply by using [^\p{Cyrillic}]

answered Oct 04 '14 at 17:57

revo

47,783
14
74
117

1

Are cyrillic characters the only ones that are confusables and inside the letter category? I fear there might be more confusable letters than cyrillic letters. – user764754 Oct 04 '14 at 18:01
@user764754 Yes, Cyrillic characters are the most common characters used in homograph attacks. However in this way I excluded all the characters from this lovely set, but as wikipedia states `it contains 11 lowercase glyphs that are identical or nearly identical to Latin counterparts` – revo Oct 04 '14 at 18:09
1

_"usernames to only consist of letters and digits regardless of language"_ if I understand it well, users should be able to register using their native character set. So blindly rejecting some Cyrillic characters will arbitrary prevent genuine user names using that character set. – Sylvain Leroux Oct 04 '14 at 18:13
@revo this is certainly helpful but the fact that cyrillic chars are the ones being used the most doesn't quite make it secure when there are other chars an attacker could use. At Sylvain Leroux: Yes, but I think allowing confusables under certain conditions would result in great complexity. – user764754 Oct 04 '14 at 18:16
@user764754 The topic itself is arguable but its not practical due to the massive and large comparison between languages. For me, if I were you, I'd made a language a base (_English maybe?!_) and try to collect homographic characters to its letters. This way I'm sure people can have many usernames even identical but that doesn't matter because all I am aware of is my base language, which I made it safe. – revo Oct 04 '14 at 18:52
This seems like unusual punishment for anybody who wants to use Cyrillic script (myself included, sometimes!). Simplest solution is to allow Latin names only - no confusion, no discrimination – mvp Oct 05 '14 at 06:50

score 0 · Answer 2 · answered Oct 07 '14 at 00:03

The Unicode standard includes a list of confusable characters at http://www.unicode.org/Public/security/revision-02/confusables.txt

This list is incomplete according to some, and too aggressive according to others, but take a look at it in order to understand how difficult the problem is to solve generically.

Allow only letters and digits in strings but without confusables

2 Answers2