3

When I convert a UTF-8 String with chars that are not known in 8859-1 to 8859-1 then i get question marks here and there. Sure what sould he do else!

Is there a java tool that can map a string like "İKEA" to "IKEA" and avoid ? to make the best out of it?

Hasan Tuncay
  • 1,090
  • 2
  • 11
  • 30
  • http://stackoverflow.com/questions/285228/how-to-convert-utf-8-to-us-ascii-in-java – kodmanyagha May 15 '13 at 14:57
  • @Hasan Sorry I erroneously voted to close, after re-reading upvoted your question. – stacker May 15 '13 at 15:41
  • This question is not a dup! The suggest solutions works for US-ASCII only, but iso8859-1 contains also several letters like ÄÖÜ which should be distinguished from İ (Contained in utf-8 but not in iso8859-1) – stacker May 16 '13 at 12:20

1 Answers1

1

For the specific example, you can:

  • decompose the letters and diacritics using compatibility form Unicode normalization
  • instruct the encoder to drop unsupported characters (the diacritics)

Example:

ByteArrayOutputStream out = new ByteArrayOutputStream();
// create encoder
CharsetEncoder encoder = StandardCharsets.ISO_8859_1.newEncoder();
encoder.onUnmappableCharacter(CodingErrorAction.IGNORE);
// write data
String ikea = "\u0130KEA";
String decomposed = Normalizer.normalize(ikea, Form.NFKD);
CharBuffer cbuf = CharBuffer.wrap(decomposed);
ByteBuffer bbuf = encoder.encode(cbuf);
out.write(bbuf.array());
// verify
String decoded = new String(out.toByteArray(), StandardCharsets.ISO_8859_1);
System.out.println(decoded);

You're still transcoding from a character set that defines 109,384 values (Unicode 6) to one that supports 256 so there will always be limitations.

Also consider a more sophisticated transformation API like ICU for features like transliteration.

McDowell
  • 107,573
  • 31
  • 204
  • 267