Convert optically equivalent unicode strings to ASCII in Java?

Question

I run a social network that requires unicode usernames to be unique (as expected).

Some creative users have started using Cyrillic (and other) unicode characters to create optically equivalent (but unicode distinct) usernames.

For example, they'll use the Cyrillic small letter a 'а', which looks identical to the roman one.

Does anyone know of a way to convert these optically equivalent characters automatically in Java? I'd rather not have to create a conversion table by hand if a mechanism already exists.

http://stackoverflow.com/questions/2096667/convert-unicode-to-ascii-without-changing-the-string-length-in-java/2097224#2097224 — user3020494, Nov 24 '13 at 02:08
The referenced answer doesn't solve the problem at hand. The first answer simply removes diacritical marks and converts the remaining non-ASCII characters to '?'s. The second answer regarding Normalizer.Form.NFD does not affect the Cyrillic letter 'a' at all. — OnesAndZeroes, Nov 24 '13 at 02:17

score 1 · Answer 1 · answered Nov 24 '13 at 02:13

You can try Unicode normalization - basically, indistinguishable code points have a 'canonical' code point designated, and normalization is the process of replacing each character with its canonical form.

Java seems to support Unicode normalization via java.text.Normalizer - more info here.

However, I'm not sure that latin A and cyrillic A are marked as equivalent in Unicode - you'd have to try.

This will also not help you when your users start using very similar instead of identical characters - humans are very inventive and a technical solution might not work 100% here, so you will probably have to resort to human moderation anyway.

There are also some other solutions - limiting the usernames to latin alphanumerics, for example.

Yeah...I tried the Normalizer approach, and it looks like latin a and cyrillic a are not marked as equivalent. Looks like I may just have to build a conversion table by hand. Bummer. — OnesAndZeroes, Nov 24 '13 at 02:19

score 1 · Answer 2 · answered Nov 24 '13 at 02:49

1

Why don't you try to apply an OCR library.

answered Nov 24 '13 at 02:49

Andyz Smith

698
5
20

Yeah, one could even statically perform the OCR and build up the desired translation tables, vs having to do the OCR analysis on the fly. – Hot Licks Nov 24 '13 at 03:09
I considered writing something to compare the pixels between characters, but decided just to go through the unicode tables by hand. The Cyrillic, Greek and Latin sets seemed to have the most offenders. It wasn't too bad in the end. – OnesAndZeroes Nov 24 '13 at 05:02

Convert optically equivalent unicode strings to ASCII in Java?

2 Answers2