0

Given a unicode string of names like

"Guns N’ Roses, 2 × 4, Rust in Peace… Polaris, Black No. 1 (Little Miss Scare‐All), À Tout Le Monde"

where each name contains some non-ASCII character ('’', '×', '…', '‐', 'À'), I am looking for an algorithm that will simplify it to

"Guns N' Roses, 2 x 4, Rust in Peace... Polaris, Black No. 1 (Little Miss Scare-All), A Tout Le Monde"

where each non-ASCII character has been replaced by an ASCII substitute.

I know I can handle a whole class of characters (including 'À') by doing

Normalizer.normalize(value, Form.NFD).replaceAll("[^\\p{ASCII}]", "");

and of course, I could trivially handle any other character by using a number of hand-crafted .replaceAll(). But I wonder whether there is some standard algorithm that does not require enumerating all remaining characters I want to substitute. Is there even a name for what I want to do?

Marco Eckstein
  • 4,448
  • 4
  • 37
  • 48
  • 1
    While normalization decomposition will allow you to convert `À` to `A`, there is no algorithm that will do the other conversions. You could try something like parsing [NamesList.txt](https://www.unicode.org/Public/UNIDATA/NamesList.txt) and for any non-ASCII character, use its first cross reference which happens to be an ASCII character as a replacement, but [the parsing rules are more complicated than a glance suggests](http://www.unicode.org/Public/UNIDATA/NamesList.html), and you still wouldn’t have a way to convert the multiplication sign (`×`) to a lowercase `x`. – VGR Mar 24 '19 at 22:41
  • 1
    It's called transliteration. It can be locale and word dependent. Å could become A or Aa. – Tom Blodget Mar 24 '19 at 22:43
  • 2
    StringUtils#stripAccents() might help here. – ck1 Mar 24 '19 at 23:01
  • 1
    And Ä should become AE if your target language is German. That's the problem with fallback characters; there is no fixed conversion table. – Mr Lister Mar 25 '19 at 18:46

1 Answers1

0

If you want a universal solution, StringUtils.stripAccents rules here. The accented letters won't become digraphs (like oe or ae), however. Also some non-existing in ASCII but not accented characters, like the German ß have to be handled one by one afterwards, preferably by the chained native replace() or replaceAll() String.class methods.

Possible duplicate of Is there a way to get rid of accents and convert a whole string to regular letters?

Example:

żółtość wszędzie, łatwo wątpić w zieloność ówczesnego świata (Polish); école publique et laïque a fait de la orthographe strictement normalisée, sinon sa principale règle (French); eine große Online-Umfrage in  mittleren Großstädten zeigt, wo Fußgänger und ÖPNV-Nutzer zufrieden sind (German)

results in

zołtosc wszedzie, łatwo watpic w zielonosc owczesnego swiata (Polish); ecole publique et laique a fait de la orthographe strictement normalisee, sinon sa principale regle (French); eine große Online-Umfrage in  mittleren Großstadten zeigt, wo Fußganger und OPNV-Nutzer zufrieden sind (German)