0

I need to support the situation where a user submits an invalid XML file to me and I report back to them information about the error. Ideally the location of the error (line number and column number) and the nature of the error.

My sample code (see below) works well enough when there is a missing tag or similar error. In that case, I get an approximate location and a useful explanation. However my code fails spectacularly when the XML file contains non-UTF-8 characters. In this case, I get a useless error:

com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence.

I cannot find a way to determine the line number where the invalid character might be, nor the character itself. Is there a way to do this?

If, as one comment suggests, it may not be possible as we don't get to the parsing step, is there a way to process the XML file, not with a parser, but simply line-by-line, looking for and reporting non-UTF-8 characters?

Sample code follows. First a basic error handler:

public class XmlErrorHandler implements ErrorHandler {
    @Override
    public void warning(SAXParseException e) throws SAXException {
        show("Warning", e); throw e;
    }

    @Override
    public void error(SAXParseException e) throws SAXException {
        show("Error", e); throw e;
    }

    @Override
    public void fatalError(SAXParseException e) throws SAXException {
        show("Fatal", e); throw e;
    }

    private void show(String type, SAXParseException e) {
        System.out.println("Line " + e.getLineNumber() + " Column " + e.getColumnNumber());
        System.out.println(type + ": " + e.getMessage());
    }
}

And a trivial test program:

public class XmlTest {
    public static void main(String[] args) {
        try {
            SAXParserFactory spf = SAXParserFactory.newInstance();
            SAXParser parser = spf.newSAXParser();
            XMLReader reader = parser.getXMLReader();
            reader.setContentHandler(new DefaultHandler());
            reader.setErrorHandler(new XmlErrorHandler());
            InputSource is = new InputSource(args[0]);
            reader.parse(is);
        }
        catch (SAXException e) {      // Useful error case
            System.err.println(e);
            e.printStackTrace(System.err);
        }
        catch (Exception e) {         // Useless error case arrives here
            System.err.println(e);
            e.printStackTrace();
        }
    }
}

Sample XML File (with non-UTF-8 smart quotes from (say) a Word document):

<?xml version="1.0" encoding="UTF-8"?>
<example>
    <![CDATA[Text with <91>smart quotes<92>.]]>
</example>
dave
  • 11,641
  • 5
  • 47
  • 65
  • possible duplicate https://stackoverflow.com/questions/15545720/how-to-fix-invalid-byte-1-of-1-byte-utf-8-sequence – Martin Frank Feb 09 '18 at 07:53
  • Possible duplicate of [How to fix Invalid byte 1 of 1-byte UTF-8 sequence](https://stackoverflow.com/questions/15545720/how-to-fix-invalid-byte-1-of-1-byte-utf-8-sequence) – Martin Frank Feb 09 '18 at 07:53
  • The reason Xerces isn't giving you a line number is that the error comes from a level of the system where the bytes haven't yet been turned into lines of characters. Indeed, if the UTF-8 can't be decoded then identifying line endings may not be possible. – Michael Kay Feb 09 '18 at 08:04

1 Answers1

0

I had some success with identifying where the issue in the XML file is using a couple of approaches.

Adapting the code from my question to use a home-grown ContentHandler with a Locator (see below) demonstrated that the XML was being processed up until the invalid character is encountered. In particular, the line number is being tracked. Preserving the line number allowed it to be retrieved from the ContentHandler when the problematic exception occurs.

At this point, I came up with two possibilities. The first is to re-run the processing with a different encoding on the InputStream, eg. Windows-1252. Parsing completed without error in this instance and I was able to retrieve the characters on the line with the known issue. This allows for a reasonably useful error message to the user, ie. line number and the characters.

My second approach was to adapt the code from the top-rated answer to this SO question. This code allows you to find the first non-UTF-8 character in a byte stream. If you assume that 0x0A (linefeed) represents a new line in the XML (and this seems to work pretty well in practice), then the line number, column number and the invalid characters can be extracted easily enough for a precise error message.

// Modified test program
public class XmlTest {
    public static void main(String[] args) {
        ErrorFinder errorFinder = new ErrorFinder(0); // Create our own content handler
        try {
            SAXParserFactory spf = SAXParserFactory.newInstance();
            SAXParser parser = spf.newSAXParser();
            XMLReader reader = parser.getXMLReader();
            reader.setContentHandler(errorFinder); // Use instead of the default handler
            reader.setErrorHandler(new XmlErrorHandler());
            InputSource is = new InputSource(args[0]);
            reader.parse(is);
        }
        catch (SAXException e) {      // Useful error case
            System.err.println(e);
            e.printStackTrace(System.err);
        }
        catch (Exception e) {         // Useless error case arrives here
            System.err.println(e);
            e.printStackTrace();
            // Option 1: repeat parsing (see above) with a new ErrorFinder initialised thus:
            ErrorFinder ef2 = new ErrorFinder(errorFinder.getCurrentLineNumber()); // and
            is.setEncoding("Windows-1252");
        }
    }
}

// Content handler with irrelevant method implementations elided.
public class ErrorFinder implements ContentHandler {
    private int lineNumber; // If non-zero, the line number to retrieve characters for.
    private int currentLineNumber;
    private char[] chars;
    private Locator locator;

    public ErrorFinder(int lineNumber) {
        super();
        this.lineNumber = lineNumber;
    }

    public void setDocumentLocator(Locator locator) {
        this.locator = locator;
    }

    @Override
    public void startDocument() throws SAXException {
        currentLineNumber = locator.getLineNumber();
    }

    ... // Skip other over-ridden methods as they have same code as startDocument().

    @Override
    public void characters(char[] ch, int start, int length) throws SAXException {
        currentLineNumber = locator.getLineNumber();
        if (currentLineNumber == lineNumber) {
            char[] c = new char[length];
            System.arraycopy(ch, start, c, 0, length);
            chars = c;
        }
    }

    public int getCurrentLineNumber() {
        return currentLineNumber;
    }

    public char[] getChars() {
        return chars;
    }
}
dave
  • 11,641
  • 5
  • 47
  • 65