1

I'm trying to parse this XML document with a SAX parser:

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE WIN_TPBOUND_MESSAGES SYSTEM "tpbound_messages_v1.dtd">
<WIN_TPBOUND_MESSAGES>
    <SMSTOTP>
        <SOURCE_ADDR>+447522579247</SOURCE_ADDR>
        <TEXT>TEST: @£$¥èéùìò?ØøÅå&amp; ^{}\\[~]¡&#8364;ÆæßÉ!\"#¤%'()*+,-./0123456789:;&lt;=&gt;? ÄÖÑܧ¿äöñüà end</TEXT>
        <WINTRANSACTIONID>652193268</WINTRANSACTIONID>
    </SMSTOTP>
</WIN_TPBOUND_MESSAGES>

After parsing the <TEXT> element, the content is converted to:

TEST: @£$¥èéùìò?Ã�øÃ�Ã¥& ^{}\\[~]¡€Ã�æÃ�Ã�!\"#¤%'()*+,-./0123456789:;<=>? Ã�Ã�Ã�Ã�§¿äöñüà end

So clearly something bad is happening to the non-ASCII characters. The code that parses the XML is shown below:

public void parse(InputStream xmlStream) throws WinGatewayException {
    XMLReader parser = XMLReaderFactory.createXMLReader("org.apache.xerces.parsers.SAXParser");
    parser.setContentHandler(this);
    parser.setErrorHandler(error);
    parser.setEntityResolver(new DTDResolver());
    parser.setDTDHandler(this);
    parser.setFeature("http://xml.org/sax/features/validation", true);
    parser.setFeature("http://apache.org/xml/features/validation/schema", true);
    parser.setFeature("http://apache.org/xml/features/nonvalidating/load-dtd-grammar", true);
    parser.setFeature("http://xml.org/sax/features/namespace-prefixes", true);
    parser.setFeature("http://apache.org/xml/features/continue-after-fatal-error", false);
    parser.parse(new InputSource(xmlStream));
}

and the object referred to by this has methods such as:

public void endElement(String uri, String localName, String qName)
        throws SAXException {

        if (localName.equals("TEXT")) {   
            logger.debug("Parsed message text: " + cData.toString());
            message.setText(cData.toString());
        }
}

Why aren't these non-ASCII characters being preserved by the XML parser?

Dónal
  • 185,044
  • 174
  • 569
  • 824
  • 2
    Depends what `xmlStream` is. Is it a `Reader` or an `InputStream`? Also, what is `cData`? – artbristol Jun 21 '12 at 10:47
  • Please try enclosing non-ascii chars under CDATA section. [Chere here](http://stackoverflow.com/questions/2784183/what-does-cdata-in-xml-mean) – Santosh Jun 21 '12 at 10:48
  • @artbristol I've updated the code to show that it's an `InputStream` – Dónal Jun 21 '12 at 10:54
  • Are those characters *genuinely* ISO-8895-1 encoded? What are the bytes involved for the "£" sign for example? – Jon Skeet Jun 21 '12 at 10:58

1 Answers1

3

I believe your XML file is actually in UTF-8 rather than ISO-8859-1.

An ISO-8859-1-encoded file would have a single byte per character, so the UK pound sign would be a single byte 0xA3. However, it looks like your file has 0xC2 0xA3, which is the byte sequence you'd get for U+00A3 in UTF-8.

Change the XML declaration to reflect this:

<?xml version="1.0" encoding="UTF-8"?>

and see if that fixes things. Assuming it does, you then need to work out what's produced this bad data to start with.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • That's one possibility. Another possibility is that the input file is OK, and indeed that the data being logged is OK, but the content of the log is being displayed incorrectly: perhaps the log file contains UTF-8 characters, but is being displayed as if it contained iso-8859-1. – Michael Kay Jun 21 '12 at 11:49
  • @MichaelKay: That's possible, but very unlikely given the input data. The chances of all those "Â" characters just *happening* to be part of useful UTF-8 characters is pretty small, I'd say. – Jon Skeet Jun 21 '12 at 11:50