convert unicode string to ASCII in java which works in unix/linux

Question

I have already tried using Normalizer

String s = "口水雞 hello Ä";

String s1 = Normalizer.normalize(s, Normalizer.Form.NFKD);
String regex = Pattern.quote("[\\p{InCombiningDiacriticalMarks}\\p{IsLm}\\p{IsSk}]+");

String s2 = new String(s1.replaceAll(regex, "").getBytes("ascii"), "ascii");

System.out.println(s2);
System.out.println(s.length() == s2.length());

i want it to work in Unix/Linux ,

i got this from http://stackoverflow.com/questions/15356716/how-can-i-convert-unicode-string-to-ascii-in-java — anshulkatta, Jun 26 '14 at 06:38

score 1 · Answer 1 · edited May 23 '17 at 12:20

1

There is an ASCII character class for matching code points in the ASCII set:

String s = "口水雞 hello Ä";

String s1 = Normalizer.normalize(s, Normalizer.Form.NFKD);
String nonAscii = "[^\\p{ASCII}]+";
String s2 = s1.replaceAll(nonAscii, "");

System.out.println(s2);
System.out.println(s.length() == s2.length());

As Joop Eggan notes, Java string and char types are always UTF-16. You can only have ASCII-encoded data in byte form:

byte[] ascii = s2.getBytes(StandardCharsets.US_ASCII);

edited May 23 '17 at 12:20

Community

1
1

answered Jun 26 '14 at 08:26

McDowell

107,573
31
204
267

_Short and simple._ The length comparison in general does not make sense. But it seems this is the answer. Any other problems are located elsewhere. However a conversion still might make sense, as some decoders may substitute special quotes (`“ ”`) and such by the ASCII quotes. – Joop Eggen Jun 26 '14 at 09:45

Joop Eggen · Answer 2 · 2014-06-30T06:25:55.430

0

Explanation

First in java text (String/Reader/Writer) is already Unicode. For the java source code (String literals) the editor and the javac compiler should use the same encoding. Ideally UTF-8.

The normalizer splits into base letter and combining diacritical mark(s) and regular expression removes those marks. Converting text with accents like ä é ﬁ ﬂ ĉ œ to a e fi fl c oe to ASCII.

Hence you would get - I think - "??? hello A".

Charset ascii = StandardCharsets.US_ASCII;
String s2 = new String(s1.replaceAll(regex, "").getBytes(ascii), ascii);

To prevent receiving the question marks (and distinguishing between a ? in the original string), you can use a Charset.newDecoder().

For ASCII you would still need some transliteration to latin script.

Answer

As most Linux operating systems of newer origin already use UTF-8 as operating system default, you probably can simply do:

System.out.println("We are using encoding: " + System.getProperty("file.encoding"));
System.out.println(s);

Here s is converted to the operating system encoding.

edited Jun 30 '14 at 06:25

answered Jun 26 '14 at 07:24

Joop Eggen

107,315
7
83
138

yes exactly..i want to prevent ? character , i want it to print the PROJEçãO character to PROJECAO , actually i am able to convert it when i type the PROJEçãO to a string object , but when i read from file it prints with PROJE??? – anshulkatta Jun 26 '14 at 07:40
"The regular expression converts text like ä é ß ĉ œ to a e ss c oe, with accents, to ASCII." What do you mean by that? Normalizer does not split ligatures (ß or œ) nor many letters with diacritics (ø or ł). – Karol S Jun 28 '14 at 16:43
@KarolS yes, bad formulation (partly corrected); intended to clarify should it not be clear. I did not find in the characer map the ff or fi ligatures; as using string length is not so good an idea. – Joop Eggen Jun 28 '14 at 16:57
`ﬀ` is \uFB00, and this ligature is normalized to `ff`. But still, normalizer doesn't do anything to ß. `Normalizer("ß", Normalizer.Form.WHICHEVER)` returns `"ß"`, not `"ss"`, and the regex does nothing afterwards. – Karol S Jun 28 '14 at 20:25

convert unicode string to ASCII in java which works in unix/linux

2 Answers2