parsing non-ASCII character in XML document

Question

I'm trying to parse this XML document with a SAX parser:

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE WIN_TPBOUND_MESSAGES SYSTEM "tpbound_messages_v1.dtd">
<WIN_TPBOUND_MESSAGES>
    <SMSTOTP>
        <SOURCE_ADDR>+447522579247</SOURCE_ADDR>
        <TEXT>TEST: @£$¥èéùìò?ØøÅå&amp; ^{}\\[~]¡&#8364;ÆæßÉ!\"#¤%'()*+,-./0123456789:;&lt;=&gt;? ÄÖÑÜ§¿äöñüà end</TEXT>
        <WINTRANSACTIONID>652193268</WINTRANSACTIONID>
    </SMSTOTP>
</WIN_TPBOUND_MESSAGES>

After parsing the <TEXT> element, the content is converted to:

TEST: @Â£$Â¥Ã¨Ã©Ã¹Ã¬Ã²?Ã�Ã¸Ã�Ã¥& ^{}\\[~]Â¡€Ã�Ã¦Ã�Ã�!\"#Â¤%'()*+,-./0123456789:;<=>? Ã�Ã�Ã�Ã�Â§Â¿Ã¤Ã¶Ã±Ã¼Ã  end

So clearly something bad is happening to the non-ASCII characters. The code that parses the XML is shown below:

public void parse(InputStream xmlStream) throws WinGatewayException {
    XMLReader parser = XMLReaderFactory.createXMLReader("org.apache.xerces.parsers.SAXParser");
    parser.setContentHandler(this);
    parser.setErrorHandler(error);
    parser.setEntityResolver(new DTDResolver());
    parser.setDTDHandler(this);
    parser.setFeature("http://xml.org/sax/features/validation", true);
    parser.setFeature("http://apache.org/xml/features/validation/schema", true);
    parser.setFeature("http://apache.org/xml/features/nonvalidating/load-dtd-grammar", true);
    parser.setFeature("http://xml.org/sax/features/namespace-prefixes", true);
    parser.setFeature("http://apache.org/xml/features/continue-after-fatal-error", false);
    parser.parse(new InputSource(xmlStream));
}

and the object referred to by this has methods such as:

public void endElement(String uri, String localName, String qName)
        throws SAXException {

        if (localName.equals("TEXT")) {   
            logger.debug("Parsed message text: " + cData.toString());
            message.setText(cData.toString());
        }
}

Why aren't these non-ASCII characters being preserved by the XML parser?

Depends what `xmlStream` is. Is it a `Reader` or an `InputStream`? Also, what is `cData`? — artbristol, Jun 21 '12 at 10:47
Please try enclosing non-ascii chars under CDATA section. [Chere here](http://stackoverflow.com/questions/2784183/what-does-cdata-in-xml-mean) — Santosh, Jun 21 '12 at 10:48
@artbristol I've updated the code to show that it's an `InputStream` — Dónal, Jun 21 '12 at 10:54
Are those characters *genuinely* ISO-8895-1 encoded? What are the bytes involved for the "£" sign for example? — Jon Skeet, Jun 21 '12 at 10:58

score 3 · Accepted Answer · answered Jun 21 '12 at 11:00

3

I believe your XML file is actually in UTF-8 rather than ISO-8859-1.

An ISO-8859-1-encoded file would have a single byte per character, so the UK pound sign would be a single byte 0xA3. However, it looks like your file has 0xC2 0xA3, which is the byte sequence you'd get for U+00A3 in UTF-8.

Change the XML declaration to reflect this:

<?xml version="1.0" encoding="UTF-8"?>

and see if that fixes things. Assuming it does, you then need to work out what's produced this bad data to start with.

answered Jun 21 '12 at 11:00

Jon Skeet

1,421,763
867
9,128
9,194

That's one possibility. Another possibility is that the input file is OK, and indeed that the data being logged is OK, but the content of the log is being displayed incorrectly: perhaps the log file contains UTF-8 characters, but is being displayed as if it contained iso-8859-1. – Michael Kay Jun 21 '12 at 11:49
@MichaelKay: That's possible, but very unlikely given the input data. The chances of all those "Â" characters just *happening* to be part of useful UTF-8 characters is pretty small, I'd say. – Jon Skeet Jun 21 '12 at 11:50

parsing non-ASCII character in XML document

1 Answers1