I need to support the situation where a user submits an invalid XML file to me and I report back to them information about the error. Ideally the location of the error (line number and column number) and the nature of the error.
My sample code (see below) works well enough when there is a missing tag or similar error. In that case, I get an approximate location and a useful explanation. However my code fails spectacularly when the XML file contains non-UTF-8 characters. In this case, I get a useless error:
com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence.
I cannot find a way to determine the line number where the invalid character might be, nor the character itself. Is there a way to do this?
If, as one comment suggests, it may not be possible as we don't get to the parsing step, is there a way to process the XML file, not with a parser, but simply line-by-line, looking for and reporting non-UTF-8 characters?
Sample code follows. First a basic error handler:
public class XmlErrorHandler implements ErrorHandler {
@Override
public void warning(SAXParseException e) throws SAXException {
show("Warning", e); throw e;
}
@Override
public void error(SAXParseException e) throws SAXException {
show("Error", e); throw e;
}
@Override
public void fatalError(SAXParseException e) throws SAXException {
show("Fatal", e); throw e;
}
private void show(String type, SAXParseException e) {
System.out.println("Line " + e.getLineNumber() + " Column " + e.getColumnNumber());
System.out.println(type + ": " + e.getMessage());
}
}
And a trivial test program:
public class XmlTest {
public static void main(String[] args) {
try {
SAXParserFactory spf = SAXParserFactory.newInstance();
SAXParser parser = spf.newSAXParser();
XMLReader reader = parser.getXMLReader();
reader.setContentHandler(new DefaultHandler());
reader.setErrorHandler(new XmlErrorHandler());
InputSource is = new InputSource(args[0]);
reader.parse(is);
}
catch (SAXException e) { // Useful error case
System.err.println(e);
e.printStackTrace(System.err);
}
catch (Exception e) { // Useless error case arrives here
System.err.println(e);
e.printStackTrace();
}
}
}
Sample XML File (with non-UTF-8 smart quotes from (say) a Word document):
<?xml version="1.0" encoding="UTF-8"?>
<example>
<![CDATA[Text with <91>smart quotes<92>.]]>
</example>