Convert combining diaereses to ISO 8859-1

Question

This is similar to this question, but I specifically need to know how to convert to ISO-8859-1 format, not UTF-8.

Short question: I need a character with combining diaereses converted to the Latin-1 equivalent (if it exists).

Longer question: I have German strings that contain combining diaereses (UTF-8: [cc][88] AKA UTF code point U+0308), but my database only supports ISO-8859-1 (e.g. Latin-1). Because the characters/combining diaereses are "decomposed", I can't just "convert" to ISO-8859-1 because the byte sequence [cc][88] acts on the preceding character, which may not have a corresponding character in ISO-8859-1.

I tried this code:

import java.nio.charset.Charset;
import java.nio.ByteBuffer;
import java.nio.CharBuffer;

//ü has combining diaereses
String s = "für"
Charset utf8charset = Charset.forName("UTF-8");
Charset iso88591charset = Charset.forName("ISO-8859-1");

ByteBuffer inputBuffer = ByteBuffer.wrap(s.getBytes());

// decode UTF-8
CharBuffer data = utf8charset.decode(inputBuffer);

// encode ISO-8559-1
ByteBuffer outputBuffer = iso88591charset.encode(data);
byte[] outputData = outputBuffer.array();

isoString = new String(outputData);

//isoString is "fu?r"

But it just fails to encode the combining diaereses rather than seeing that it could convert to U+00F6/[c3][bc]. Is there a library that can detect when a character followed by combining diaereses can map to an existing ISO-8859-1 character? (Preferably in Java)

score 3 · Accepted Answer · answered Sep 23 '14 at 21:38

3

You need to normalize before you encode.

Use the Normalizer class to convert to a decomposed form and then encode.

answered Sep 23 '14 at 21:38

bmargulies

97,814
39
186
310

Upvoted, cause that worked. I'm going to post my example as another answer for a more explicit example. – Devin Sep 23 '14 at 23:52

score 1 · Answer 2 · answered Sep 23 '14 at 23:58

Expounding on bmargulies answer, Normalizing was the key.

Here is the code that worked:

import java.nio.charset.Charset;
import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.text.Normalizer;
import java.text.Normalizer.Form;

private static String encodeToLatin1(byte[] input) {
    String encodedString = null;

    Charset utf8charset = Charset.forName("UTF-8");
    Charset latin1 = Charset.forName("ISO-8859-1");

    ByteBuffer inputBuffer = ByteBuffer.wrap(input);
    CharBuffer data = utf8charset.decode(inputBuffer);
    ByteBuffer outputBuffer = latin1.encode(Normalizer.normalize(data, Normalizer.Form.NFC));

    try {
        encodedString = new String(outputBuffer.array(), "ISO-8859-1");
    } catch (UnsupportedEncodingException e) {
        //do stuff    
    }
    return encodedString;
}

//String with a combining diaereses and without
String s = "Lösung für"    

//Returns "Lösung für"
encodeToLatin1(s.getBytes())

Convert combining diaereses to ISO 8859-1

2 Answers2