Explanation
First in java text (String/Reader/Writer) is already Unicode. For the java source code (String literals) the editor and the javac compiler should use the same encoding. Ideally UTF-8.
The normalizer splits into base letter and combining diacritical mark(s) and regular expression removes those marks. Converting text with accents like ä é fi fl ĉ œ
to a e fi fl c oe
to ASCII.
Hence you would get - I think - "??? hello A"
.
Charset ascii = StandardCharsets.US_ASCII;
String s2 = new String(s1.replaceAll(regex, "").getBytes(ascii), ascii);
To prevent receiving the question marks (and distinguishing between a ?
in the original string), you can use a Charset.newDecoder()
.
For ASCII you would still need some transliteration to latin script.
Answer
As most Linux operating systems of newer origin already use UTF-8 as operating system default, you probably can simply do:
System.out.println("We are using encoding: " + System.getProperty("file.encoding"));
System.out.println(s);
Here s
is converted to the operating system encoding.