11

Does someone know a easy way to find characters in Unicode that are similar to ASCII characters. An example is the "CYRILLIC SMALL LETTER DZE (ѕ)". I'd like to do a search and replace for similar characters. By similar I mean human readable. You can't see a difference by looking at it.

Nietzche-jou
  • 14,415
  • 4
  • 34
  • 45
DrDol
  • 2,220
  • 2
  • 19
  • 23

2 Answers2

16

As noted by other commenters, Unicode normalisation ("compatibilty characters") isn't going to help you here as you aren't looking for official equivalences but for similarities in glyphs (letter shapes). (The linked Unicode Technical Report is still worth reading, though, as it is extremely well written.)

If I were you, to spare you the tedious work of assembling a list of characters yourself, I'd search for resources on homograph attacks: This is a method of maliciously misleading web users by displaying URLs containing domain names in which some letters have been replaced with visually similar letters. Another Unicode Technical Report, on security, contains a section on the problem. There is also -- and that may be what you most need -- a "confusables" table. Here's another article with mainly punctuation marks, some of which ASCII, that have visually similar counterparts in the non-ASCII code tables.

What I do hope is that you aren't asking the question to construct such an attack.

Tschallacka
  • 27,901
  • 14
  • 88
  • 133
chryss
  • 7,459
  • 37
  • 46
  • Thanks for all the good links and explanations. I actually try to protect against such attacks. :-) And I guess I will find some further stuff with the keyword "homograph attack". – DrDol Aug 04 '10 at 22:34
  • That is good to hear :) . Yeah, that's the keyword you need! I edited a link (it pointed to an obsolete version). – chryss Aug 04 '10 at 22:40
  • A legitimate use: for internationalization testing, I have a tool that generates fake foreign language text using similar-looking characters. An english-speaking tester can read the "foreign" text, but they can also clearly tell that it is not hard-coded English. Although it doesn't work if the unicode character is so similar that you can't tell the difference. I mainly do things like add accents to the vowels. – Kip Apr 02 '15 at 15:14
  • I'm using this to make an ircbot which does not highlight anyone if it mentions somebody in a channel :) – Christophe De Troyer Aug 28 '20 at 12:43
-2

See the Unicode Database: http://www.unicode.org/Public/UNIDATA/UnicodeData.txt.

Each line describes a unicode caharacter, for example:

1E9A;LATIN SMALL LETTER A WITH RIGHT HALF RING;Ll;0;L;<compat> 0061 02BE;;;;N;;;;;

If there's any similar (compatible) characters for that symbol, it will appear in the <compat> field of the entry. In this example, 0061 (ASCII a) is compatible to the LATIN SMALL LETTER A WITH RIGHT HALF RING Unicode character.

As for your character, the entry is

0455;CYRILLIC SMALL LETTER DZE;Ll;0;L;;;;;N;;;0405;;0405

which, as you can see, does not specify a compatibility character.

adamk
  • 45,184
  • 7
  • 50
  • 57
  • The compatibility field describes a sequence of characters that'd mean the same thing as the character in question. In your example, the compatible sequence would be `U+0061` (the letter 'a') followed by `U+02BE` (the 'right half ring' modifier). For characters from different alphabets, it'd be pretty unusual for there to be compatibility sequences -- and that'd make what the OP is trying to do impossible without more info. – cHao Aug 04 '10 at 11:38
  • The OP stated 'similar to ASCII characters', not exact. If you're looking for an 'a' with a right half ring, you could settle for an ASCII 'a' if there's nothing else available. – adamk Aug 04 '10 at 12:10
  • Agreed -- in that case. But if you're looking for an ASCII char similar to a Cyrillic ѕ, which is the very example the OP used, that won't work. – cHao Aug 04 '10 at 12:35
  • @cHao: You're right - as I stated in my answer, for the specific character the OP requested, the compatibility characters aren't a useful method. – adamk Aug 04 '10 at 13:31