UTF-8 -- ISO 8859-1 mapping tool

Question

When I convert a UTF-8 String with chars that are not known in 8859-1 to 8859-1 then i get question marks here and there. Sure what sould he do else!

Is there a java tool that can map a string like "İKEA" to "IKEA" and avoid ? to make the best out of it?

http://stackoverflow.com/questions/285228/how-to-convert-utf-8-to-us-ascii-in-java — kodmanyagha, May 15 '13 at 14:57
@Hasan Sorry I erroneously voted to close, after re-reading upvoted your question. — stacker, May 15 '13 at 15:41
This question is not a dup! The suggest solutions works for US-ASCII only, but iso8859-1 contains also several letters like ÄÖÜ which should be distinguished from İ (Contained in utf-8 but not in iso8859-1) — stacker, May 16 '13 at 12:20

score 1 · Answer 1 · answered May 16 '13 at 15:24

For the specific example, you can:

decompose the letters and diacritics using compatibility form Unicode normalization
instruct the encoder to drop unsupported characters (the diacritics)

Example:

ByteArrayOutputStream out = new ByteArrayOutputStream();
// create encoder
CharsetEncoder encoder = StandardCharsets.ISO_8859_1.newEncoder();
encoder.onUnmappableCharacter(CodingErrorAction.IGNORE);
// write data
String ikea = "\u0130KEA";
String decomposed = Normalizer.normalize(ikea, Form.NFKD);
CharBuffer cbuf = CharBuffer.wrap(decomposed);
ByteBuffer bbuf = encoder.encode(cbuf);
out.write(bbuf.array());
// verify
String decoded = new String(out.toByteArray(), StandardCharsets.ISO_8859_1);
System.out.println(decoded);

You're still transcoding from a character set that defines 109,384 values (Unicode 6) to one that supports 256 so there will always be limitations.

Also consider a more sophisticated transformation API like ICU for features like transliteration.

UTF-8 -- ISO 8859-1 mapping tool

1 Answers1