0

I have an xml string with 8001 chars that I want to parse with SAXParser but I get the exception below. If I remove or add just one character to the xml, everything works perfectly. The xml is loaded from clob field in oracle DB.

The Exception: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 39; Content ist nicht zulässig in Prolog.

Can anyone explain me why this happens?

public static boolean isWellformed(final String xml) {
        if (xml == null) {
            return false;
        }

        SAXParser saxParser;
        DefaultHandler dh;
        try {
            final SAXParserFactory spfactory = SAXParserFactory.newInstance();
            saxParser = spfactory.newSAXParser();
            dh = new DefaultHandler();
        } catch (final Exception ex) {
            log.error("Cannot initialize SAX parser.", ex);
            return false;
        }

        ByteArrayInputStream bin = null;

        try {
            bin = new ByteArrayInputStream(xml.getBytes("UTF-8"));
            saxParser.parse(bin, dh);
        } catch (final SAXException se) {
            return false;
        } catch (final IOException ex) {
            return false;
        } finally {
            IOUtils.close(bin);
        }
        return true;
    }

The XML is generated and used by CKEditor. XML sample:

<?xml version="1.0" encoding="UTF-8"?><segment><chapter level="2" align=" center">Decisions</chapter><text>Text text  text .....</text></segment>
ninjaxelite
  • 1,139
  • 2
  • 20
  • 43
  • What does the XML file look like? – Dominique Oct 15 '18 at 12:31
  • 1
    Seems to be duplicate of https://stackoverflow.com/questions/5138696/org-xml-sax-saxparseexception-content-is-not-allowed-in-prolog – Kostiantyn Oct 15 '18 at 12:32
  • @Dominique updated my question. – ninjaxelite Oct 15 '18 at 12:36
  • @Kostiantyn the symptoms might be the same, but the cause might be quite different. The problem with the "content not allowed in prolog" error message is that it has such a wide variety of possible causes. – Michael Kay Oct 15 '18 at 16:58
  • 1
    I suggest showing us a hex display of the initial bytes of the input stream that you are parsing. – Michael Kay Oct 15 '18 at 16:58
  • I'm afraid there's not enough information for us to point where the problem is... like Michael mentioned printing the chars as hex bytes may help. Not the entire document, but the first and the last 72 chars should do it – nandsito Oct 15 '18 at 18:10
  • @MichaelKay Hex representation: 3c3f786d6c2076657273696f6e3d22312e302220656e636f64696e673d225554462d38223f3e3c7365676d656e743e3c63686170746572206c6576656c3d22322220616c69676e3d222063656e746572223e456e74736368656964756e67736772fc6e64653c2f636861707465723e3c746578743e3c7370616e207374796c653d22223e4d69742064656e204162676162656e6265736368656964656e20766f6d20362ea04e6f76656d626572a0323031332077757264656e20646572204265736368776572646566fc68726572696e2066fc7220646965204a616872... – ninjaxelite Oct 16 '18 at 07:35
  • @nandsito last hex chars: ...03420422d5647206772756e6473e4747a6c6963686520426564657574756e67207a756b6f6d6d742e2045696e65205265766973696f6e20697374206461686572206e69636874207a756ce4737369672e3c2f7370616e3e3c2f746578743e3c746578743e3c7370616e207374796c653d22223e3c7370616e207374796c653d22223e4175732064656e2064617267657374656c6c74656e20457277e467756e67656e207761722073707275636867656de4df207a7520656e74736368656964656e2e3c2f7370616e3e3c2f7370616e3e3c2f746578743e3c2f7365676d656e743e – ninjaxelite Oct 16 '18 at 07:36
  • Thanks. I'm afraid it doesn't reveal anything obviously wrong. But at least it eliminates some possible causes. – Michael Kay Oct 16 '18 at 09:26
  • If I convert the xml string to hex, I can find one more space than in the original xml. How could I find that one? orig xml 866 spaces and hex 867. – ninjaxelite Oct 16 '18 at 09:30
  • I'm not sure if it's what's causing the problem, but the document states it's encoded in UTF-8 but present ISO-8859-1 characters – nandsito Oct 16 '18 at 11:26
  • The XML document may be correct, but there may be an issue in the data pipeline that may be corrupting the text – nandsito Oct 16 '18 at 11:28

0 Answers0