0

I have already tried using Normalizer

String s = "口水雞 hello Ä";

String s1 = Normalizer.normalize(s, Normalizer.Form.NFKD);
String regex = Pattern.quote("[\\p{InCombiningDiacriticalMarks}\\p{IsLm}\\p{IsSk}]+");

String s2 = new String(s1.replaceAll(regex, "").getBytes("ascii"), "ascii");

System.out.println(s2);
System.out.println(s.length() == s2.length());

i want it to work in Unix/Linux ,

anshulkatta
  • 2,044
  • 22
  • 30

2 Answers2

1

There is an ASCII character class for matching code points in the ASCII set:

String s = "口水雞 hello Ä";

String s1 = Normalizer.normalize(s, Normalizer.Form.NFKD);
String nonAscii = "[^\\p{ASCII}]+";
String s2 = s1.replaceAll(nonAscii, "");

System.out.println(s2);
System.out.println(s.length() == s2.length());

As Joop Eggan notes, Java string and char types are always UTF-16. You can only have ASCII-encoded data in byte form:

byte[] ascii = s2.getBytes(StandardCharsets.US_ASCII);
Community
  • 1
  • 1
McDowell
  • 107,573
  • 31
  • 204
  • 267
  • _Short and simple._ The length comparison in general does not make sense. But it seems this is the answer. Any other problems are located elsewhere. However a conversion still might make sense, as some decoders may substitute special quotes (`“ ”`) and such by the ASCII quotes. – Joop Eggen Jun 26 '14 at 09:45
0

Explanation

First in java text (String/Reader/Writer) is already Unicode. For the java source code (String literals) the editor and the javac compiler should use the same encoding. Ideally UTF-8.

The normalizer splits into base letter and combining diacritical mark(s) and regular expression removes those marks. Converting text with accents like ä é fi fl ĉ œ to a e fi fl c oe to ASCII.

Hence you would get - I think - "??? hello A".

Charset ascii = StandardCharsets.US_ASCII;
String s2 = new String(s1.replaceAll(regex, "").getBytes(ascii), ascii);

To prevent receiving the question marks (and distinguishing between a ? in the original string), you can use a Charset.newDecoder().

For ASCII you would still need some transliteration to latin script.

Answer

As most Linux operating systems of newer origin already use UTF-8 as operating system default, you probably can simply do:

System.out.println("We are using encoding: " + System.getProperty("file.encoding"));
System.out.println(s);

Here s is converted to the operating system encoding.

Joop Eggen
  • 107,315
  • 7
  • 83
  • 138
  • yes exactly..i want to prevent ? character , i want it to print the PROJEçãO character to PROJECAO , actually i am able to convert it when i type the PROJEçãO to a string object , but when i read from file it prints with PROJE??? – anshulkatta Jun 26 '14 at 07:40
  • "The regular expression converts text like ä é ß ĉ œ to a e ss c oe, with accents, to ASCII." What do you mean by that? Normalizer does not split ligatures (ß or œ) nor many letters with diacritics (ø or ł). – Karol S Jun 28 '14 at 16:43
  • @KarolS yes, bad formulation (partly corrected); intended to clarify should it not be clear. I did not find in the characer map the ff or fi ligatures; as using string length is not so good an idea. – Joop Eggen Jun 28 '14 at 16:57
  • `ff` is \uFB00, and this ligature is normalized to `ff`. But still, normalizer doesn't do anything to ß. `Normalizer("ß", Normalizer.Form.WHICHEVER)` returns `"ß"`, not `"ss"`, and the regex does nothing afterwards. – Karol S Jun 28 '14 at 20:25