Given a unicode string of names like
"Guns N’ Roses, 2 × 4, Rust in Peace… Polaris, Black No. 1 (Little Miss Scare‐All), À Tout Le Monde"
where each name contains some non-ASCII character ('’', '×', '…', '‐', 'À'), I am looking for an algorithm that will simplify it to
"Guns N' Roses, 2 x 4, Rust in Peace... Polaris, Black No. 1 (Little Miss Scare-All), A Tout Le Monde"
where each non-ASCII character has been replaced by an ASCII substitute.
I know I can handle a whole class of characters (including 'À') by doing
Normalizer.normalize(value, Form.NFD).replaceAll("[^\\p{ASCII}]", "");
and of course, I could trivially handle any other character by using a number of hand-crafted .replaceAll()
. But I wonder whether there is some standard algorithm that does not require enumerating all remaining characters I want to substitute. Is there even a name for what I want to do?