4

I run a social network that requires unicode usernames to be unique (as expected).

Some creative users have started using Cyrillic (and other) unicode characters to create optically equivalent (but unicode distinct) usernames.

For example, they'll use the Cyrillic small letter a 'а', which looks identical to the roman one.

Does anyone know of a way to convert these optically equivalent characters automatically in Java? I'd rather not have to create a conversion table by hand if a mechanism already exists.

OnesAndZeroes
  • 315
  • 1
  • 9
  • http://stackoverflow.com/questions/2096667/convert-unicode-to-ascii-without-changing-the-string-length-in-java/2097224#2097224 – user3020494 Nov 24 '13 at 02:08
  • This might depend on what font is used. Tough problem. – goat Nov 24 '13 at 02:12
  • The referenced answer doesn't solve the problem at hand. The first answer simply removes diacritical marks and converts the remaining non-ASCII characters to '?'s. The second answer regarding Normalizer.Form.NFD does not affect the Cyrillic letter 'a' at all. – OnesAndZeroes Nov 24 '13 at 02:17
  • http://www.unicode.org/reports/tr39/#Confusable_Detection – ninjalj May 12 '14 at 19:44

2 Answers2

1

You can try Unicode normalization - basically, indistinguishable code points have a 'canonical' code point designated, and normalization is the process of replacing each character with its canonical form.

Java seems to support Unicode normalization via java.text.Normalizer - more info here.

However, I'm not sure that latin A and cyrillic A are marked as equivalent in Unicode - you'd have to try.

This will also not help you when your users start using very similar instead of identical characters - humans are very inventive and a technical solution might not work 100% here, so you will probably have to resort to human moderation anyway.

There are also some other solutions - limiting the usernames to latin alphanumerics, for example.

Jakub Wasilewski
  • 2,916
  • 22
  • 27
  • Yeah...I tried the Normalizer approach, and it looks like latin a and cyrillic a are not marked as equivalent. Looks like I may just have to build a conversion table by hand. Bummer. – OnesAndZeroes Nov 24 '13 at 02:19
  • @OnesAndZeroes Did you expect that they would be? – Andyz Smith Nov 24 '13 at 02:48
1

Why don't you try to apply an OCR library.

Andyz Smith
  • 698
  • 5
  • 20
  • Yeah, one could even statically perform the OCR and build up the desired translation tables, vs having to do the OCR analysis on the fly. – Hot Licks Nov 24 '13 at 03:09
  • I considered writing something to compare the pixels between characters, but decided just to go through the unicode tables by hand. The Cyrillic, Greek and Latin sets seemed to have the most offenders. It wasn't too bad in the end. – OnesAndZeroes Nov 24 '13 at 05:02