1

I am wondering why my SaxParser seems not to be able to resolve certain entities defined in an external dtd file. I am processing a huge xml file which has the following header. So the input is (heavily reduced :-)):

// myxml.xml
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE authors SYSTEM "mydtd.dtd">
<authors>
    <author>
        Bal&aacute;zs
    </author>
</authors>

And this is the incorrect output:

Bal
?zs

Obviousely &aacute; was not resolved!

This is how I have set up the parser:

// MySaxParser.java

public class MySaxParser extends DefaultHandler {

@Override
public void characters(char[] ch, int start, int length)
        throws SAXException {
    if ("author".equals(currentTag)) {
        System.out.println(String.valueOf(Arrays.copyOfRange(ch, start, start + length)));
    }
}

static public void main(String[] args) throws Exception {
    SAXParserFactory spf = SAXParserFactory.newInstance();
    spf.setFeature(XMLConstants.FEATURE_SECURE_PROCESSING, false);
    spf.setNamespaceAware(true);
    spf.setValidating(true); // From what I understood from the API this combined
                             // with '<!DOCTYPE mydtd SYSTEM "mydtd.dtd">' from
                             // the file myxml.xml should do the trick. What do I miss?

    SAXParser saxParser = spf.newSAXParser();
    XMLReader xmlReader = saxParser.getXMLReader();
    xmlReader.setContentHandler(new SAXLocalNameCount());
    xmlReader.setErrorHandler(new MyErrorHandler(System.err));

    xmlReader.parse("file:/path/to/myxml.xml");
}
}

What do I miss? Do I somehow have to do more than spf.setValidating(true) to make the parser aware of the dtd defined in the xml file header?

I should mention that the dtd and xml are syntactically and semantically correct. The dtd contains <!ENTITY aacute "&#225;" ><!-- small a, acute accent --> as a mapping for resolving. I donwloaded the files from a trusted source, so the error has to be in my Code.

Update:

As @eckes suggested, I printed the int values of the characters as they are passed into the method characters via

@Override
public void characters(char[] ch, int start, int length)
        throws SAXException {
    if ("author".equals(currentTag)) {
        for (int i = start; i < length; i++) {
            System.out.println(ch[i] + " - " + Character.getNumericValue(ch[i]));
        }
    }
}

The console output was:

B - 11
a - 10
l - 21
? - -1
z - 35
s - 28

The -1 indicates that something went wrong before the event characters was even fired, doesn't it?

My ErrorHandler:

package com.hw;

import java.io.PrintStream;

import org.xml.sax.ErrorHandler;
import org.xml.sax.SAXException;
import org.xml.sax.SAXParseException;

class MyErrorHandler implements ErrorHandler {
    private PrintStream out;

    MyErrorHandler(PrintStream out) {
        this.out = out;
    }

    private String getParseExceptionInfo(SAXParseException spe) {
        String systemId = spe.getSystemId();

        if (systemId == null) {
            systemId = "null";
        }

        String info = "URI=" + systemId + " Line=" + spe.getLineNumber() + ": "
                + spe.getMessage();

        return info;
    }

    public void warning(SAXParseException spe) throws SAXException {
        out.println("Warning: " + getParseExceptionInfo(spe));
    }

    public void error(SAXParseException spe) throws SAXException {
        String message = "Error: " + getParseExceptionInfo(spe);
        throw new SAXException(message);
    }

    public void fatalError(SAXParseException spe) throws SAXException {
        String message = "Fatal Error: " + getParseExceptionInfo(spe);
        throw new SAXException(message);
    }

}
Aufwind
  • 25,310
  • 38
  • 109
  • 154
  • The element name (here `dtd`) in the ` – Ian Roberts Aug 10 '14 at 20:05
  • @IanRoberts, yes it is, I typed it wrong, will fix that immediately. – Aufwind Aug 10 '14 at 20:06
  • 2
    Are you sure that it is not an encoding problem? – Hannes Aug 10 '14 at 20:11
  • @Hannes, the encoding is defined in the header of the xml file for the Parser to derive from: ``. Do you mean the encoding of my sysout could be another one as "ISO-8859-1"? – Aufwind Aug 10 '14 at 20:14
  • 1
    I think @Hannes is correct you should check one more time about sysout. – prashant thakre Aug 10 '14 at 20:17
  • 1
    "?" looks like the replacement char for encoders, so I think the entity was actually resolved but could not be represented in your encoding. Why dont you print out the int value for the char instead of the string. – eckes Aug 10 '14 at 20:43
  • @eckes, I added the mapping from char to int as suggested. For the character in question -1 is returned. – Aufwind Aug 10 '14 at 21:07
  • @Aufwind yes it looks like the encoder of the Reader is the problem. However I am not exactly sure why, the encoding should not apply to entities. Is the error handler not printing anything? Did you try to add the "á" directly into the test file? – eckes Aug 10 '14 at 21:15
  • @eckes, putting "á" directly into the xml file yields `B - 11, a - 10, l - 21, ? - -1`. As you can see the line with the questionmark is the last one printed - in contrast for the output I added above in my question. An expcetion is not thrown. I posted my error handler above, too to make it possible to uncover errors there, too. – Aufwind Aug 10 '14 at 21:27
  • @Aufwind hm strange. It seems to deliver different callbacks in that case? – eckes Aug 10 '14 at 21:45
  • @eckes, yes, it simply stops after the character in question if I add `á` directly to the file - but without throwing an exception. Currently I am trying to make sense of the answers to this question: http://stackoverflow.com/questions/3482494/howto-let-the-sax-parser-determine-the-encoding-from-the-xml-declaration But otherwise I am totally out of ideas. :-) – Aufwind Aug 10 '14 at 21:51
  • which SAX parser are you using? – forty-two Aug 10 '14 at 22:04
  • @forty-two, I am using `import javax.xml.parsers.SAXParser; import javax.xml.parsers.SAXParserFactory;` – Aufwind Aug 10 '14 at 22:05
  • @Aufwind Yes, I can see that, but what implementation do you pick up. Print the full name of the actual SAXParser instance. – forty-two Aug 10 '14 at 22:08
  • @forty-two, sorry, I should have known you are asking for the implementation. Here is what a sysout prints for my sax parser instance: `com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl@2382600f` – Aufwind Aug 10 '14 at 22:10

1 Answers1

3

You most certainly have a problem with the output encoding, i.e. the console or whatever that is receiving your output cannot correctly handle UTF-16 (which is the native java encoding).

And, you are also being tricked by the Characters#getNumericValue() method into thinking that you have an input or parser encoding problem. The getNumericValue() tries to interpret the character as something representing a number, not the actual code point value or anything such. As the documentation states, if you give the roman numeral fifty, Ⅼ (U+216C), the method would print 50.

Try replacing the line:

System.out.println(ch[i] + " - " + Character.getNumericValue(ch[i]));
        System.out.println(ch[i] + " - " + Character.getNumericValue(ch[i]));

with

System.out.println(ch[i] + " - " + Integer.toHexString((int) ch[i]));

and you'll probably see that it prints

? - e1

Now, how to fix the ouput encoding problem: I cannot help you there unless you give us more details.

Update

You can set the eclipse console encoding in

Run Configurations --> Common

or in the JDK/JRE using the

-Dfile.encoding

property (not 100% sure on this one).

forty-two
  • 12,204
  • 2
  • 26
  • 36
  • you were right, I made the suggested change and it yielded what you expected. What do you need to know, to help me fix the output encoding problem? Here is a first bunch of information I think it might be relevant. I assume the `char` array `ch` contains characters encoded in `"ISO-8859-1"`, right? I am currently on a Mac using Eclipse indigo. A `System.out.println("Default Encoding = " + System.getProperty("file.encoding"));`yielded `US-ASCII`. – Aufwind Aug 10 '14 at 22:32
  • char[] does not have different encodings (it is always UTF16). – eckes Aug 10 '14 at 23:11