I've been trying to use the Java SAX parser to parse an XML file in the ISO-8859-1 character encoding. This goes otherwise well, but the special characters such as ä and ö are giving me a headache. In short, the ContentHandler.characters(...) method gives me weird characters, and you cannot even use a char array to construct a String with a specified encoding.
Here's a complete minimum working example in two files:
latin1.xml:
<?xml version='1.0' encoding='ISO-8859-1' standalone='no' ?>
<x>Motörhead</x>
That file is saved in the said Latin-1 format, so hexdump gives this:
$ hexdump -C latin1.xml
00000000 3c 3f 78 6d 6c 20 76 65 72 73 69 6f 6e 3d 27 31 |<?xml version='1|
00000010 2e 30 27 20 65 6e 63 6f 64 69 6e 67 3d 27 49 53 |.0' encoding='IS|
00000020 4f 2d 38 38 35 39 2d 31 27 20 73 74 61 6e 64 61 |O-8859-1' standa|
00000030 6c 6f 6e 65 3d 27 6e 6f 27 20 3f 3e 0a 3c 78 3e |lone='no' ?>.<x>|
00000040 4d 6f 74 f6 72 68 65 61 64 3c 2f 78 3e |Mot.rhead</x>|
So the "ö" is encoded with a single byte, f6, as you'd expect.
Then, here's the Java file, saved in the UTF-8 format:
MySAXHandler.java:
import java.io.File;
import java.io.FileReader;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import org.xml.sax.InputSource;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.DefaultHandler;
public class MySAXHandler extends DefaultHandler {
private static final String FILE = "latin1.xml"; // Edit this to point to the correct file
@Override
public void characters(char[] ch, int start, int length) {
char[] dstCharArray = new char[length];
System.arraycopy(ch, start, dstCharArray, 0, length);
String strValue = new String(dstCharArray);
System.out.println("Read: '"+strValue+"'");
assert("Motörhead".equals(strValue));
}
private XMLReader getXMLReader() {
try {
SAXParser saxParser = SAXParserFactory.newInstance().newSAXParser();
XMLReader xmlReader = saxParser.getXMLReader();
xmlReader.setContentHandler(new MySAXHandler());
return xmlReader;
} catch (Exception ex) {
throw new RuntimeException("Epic fail.", ex);
}
}
public void go() {
try {
XMLReader reader = getXMLReader();
reader.parse(new InputSource(new FileReader(new File(FILE))));
} catch (Exception ex) {
throw new RuntimeException("The most epic fail.", ex);
}
}
public static void main(String[] args) {
MySAXHandler tester = new MySAXHandler();
tester.go();
}
}
The result of running this program is that it outputs Read: 'Mot�rhead'
(ö replaced with a "? in a box") and then crashes due to an assertion error. If you look into the char array, you'll see that the char that encodes the letter ö consists of three bytes. They don't make any sense to me, as in UTF-8 an ö should be encoded with two bytes.
What I have tried
I have tried converting the character array to a string, then getting the bytes of that string to pass to another string constructor with a charset encoding parameter. I have also played with CharBuffers and tried to find something that might possibly work with the Locale class to solve this problem, but nothing I try seems to work.