Is it possible to group all similar Unicode characters to suitable ASCII one

Question

I want to take all symbols (from all alphabets) that look almost the same (e.g. ð, ô, ö, õ, ø) and replace it wit closest ASCII character. So it will look like: ð, ô, ö, õ, ø -> o. And this doesn't have to be transliterating, like in this library https://github.com/gcardone/junidecode (we should not translate symbol to ASCII (close by meaning e.g. Ĉ -> s), but we should find ASCII symbol closely like Unicode group (e.g. Ĉ - C)).

"Is it possible?" Surely. The question whether it is possible without mapping each unicode-character explicitly to its ASCII-counterpart is a different question. I would, however, suspect that this is not possible without greater programming effort. — Turing85, Oct 30 '17 at 12:03
@Turing85, I tried to map every symbol, but it will take a long time to find each corresponding one. So, I thought, that, maybe, there's a simpler solution. — Twinkle_Monkey, Oct 30 '17 at 12:05
'similar' sounds like a ML effort, unless you have a homomorphism already defined, then it should be easy. — ergonaut, Oct 30 '17 at 12:32
@ergonaut, I don't understand how we can fit homomorphism here? Give me a hint, please — Twinkle_Monkey, Oct 30 '17 at 12:49
The suggested duplicate will work on regular Latin characters but not specialized ones like Fraktur ( is not going to become p), fullwidth, …. — Tom Blodget, Oct 30 '17 at 16:27
Did you really mean to map LATIN SMALL LETTER ETH to LATIN SMALL LETTER O? If so, it's going to be really hard to build your projection. — Tom Blodget, Oct 30 '17 at 16:47
https://stackoverflow.com/questions/1453171/remove-diacritical-marks-%C5%84-%C7%B9-%C5%88-%C3%B1-%E1%B9%85-%C5%86-%E1%B9%87-%E1%B9%8B-%E1%B9%89-%CC%88-%C9%B2-%C6%9E-%E1%B6%87-%C9%B3-%C8%B5-from-unicode-chars Doesn't fit to my question, because I want to replace characters by their appearance, not just removing duplicates or clearing letters. — Twinkle_Monkey, Oct 30 '17 at 18:22
`Normalizer.Form.NFD` decomposes applicable codepoints into a base letter and a combining codepoint(s). The rest of the code clears out all combining codepoints and leaves all other codepoints, including the base letters that were substituted in. — Tom Blodget, Oct 30 '17 at 23:34
@Twinkle_Monkey It's unclear if you already have a lookup function. Otherwise how do you know if you are correct? Is the Yen symbol going to turn into a Y? That's pretty much like saying $ turns into S. — ergonaut, Oct 31 '17 at 01:03

Kevin Boone · Answer 1 · 2017-10-30T16:26:48.083

I don't there is any simple solution to this problem, because the symbols you want to group, aren't really a group. The symbols Ò, Ó, Õ, Ö, Ø, and Ô are all sort of "O-like" in shape, and do have similar code points (0xD2-0xD8). In some languages they may even have somewhat similar pronunciation, although that can't be guaranteed. A case in point is the letter 'eth,' ð, which looks a bit like "o" but is not pronounced in a remotely similar way in any language (that I know of) where it is used. You've already recognized that the "ç" in French is more likely to related in pronunciation to an "s" than the "c" its shape resembles.

I think if you want to undertake this task, you will have to do it by case-by-case code point conversion (ugh!) However, a think the harder problem will not be in programming at all -- it will be finding mappings that actually make sense to a reader, given that there is little connection between symbol shape and linguistic role. The archetypal error of this kind is to render the Spanish "año"(year) as "ano" (which means "anus"). You really don't want to be making errors of this kind.

score 1 · Answer 2 · answered Oct 30 '17 at 16:42

You can remove combining characters, but not all of your examples use them. For example, ð (eth) is a letter in its own right, not a "d" with a slash. Same with the Polish "dark l", ł.

import java.text.Normalizer;

public class RemoveMarks {

  public static void main(String... argv) {
    String src = "ðôöõøĈł";
    String dst = Normalizer.normalize(src, Normalizer.Form.NFKD);
    System.out.println(dst.replaceAll("\\p{Mn}+", ""));
  }

}

This should print "ðoooøCł". You can see that the real letters "o" have had their combining characters removed, as has "C".

This prompts the question, however: why would you want to do this? Why would you want to destroy information in a way that doesn't make sense orthographically?

If you are trying to match or search or index text, you should use a Collator configured properly for the desired locale. This will automatically ignore differences that a user in that locale doesn't care about. For example, in American English, "Naïve" is identical to "naive", and "résumé" is just a stuffy way to spell "RESUME". A collator can take care of matching those variations.

Collator collator = Collator.getInstance(Locale.US);
collator.setStrength(Collator.PRIMARY);
collator.setDecomposition(Collator.CANONICAL_DECOMPOSITION);
Map<CollationKey, String> map = new HashMap<>();
map.put(collator.getCollationKey("resume"), "resume");
map.put(collator.getCollationKey("naive"), "naive");
System.out.println(map.get(collator.getCollationKey("RéSuMé"))); // resume
System.out.println(map.get(collator.getCollationKey("NAÏVE")));  // naive

Is it possible to group all similar Unicode characters to suitable ASCII one

2 Answers2