3

I'm currently facing a strange issue, which does not happen frequently. My application unmarshals a XML file using STaX with JAXB and Java-Streams (XMLStreamReader) with several millions rows and import these objects to a database on startup if XML has been changed. So far this is working correctly, except on some devices (approximately 5% of over 1000 devices). On these devices I got a javax.xml.stream.XMLStreamException. Sometimes a restart helps and the XML could be successfully processed. The XML itself has always the same content on all devices, so XML and XSD are both valid.

The exception also not always occur on same place. E.g:

Caused by: javax.xml.stream.XMLStreamException: ParseError at [row,col]:[2650616,17] Message: Element type "XX" must be followed by either attribute specifications, ">" or "/>".

Later:

[javax.xml.stream.XMLStreamException: ParseError at [row,col]:[3272359,14] Message: Element type "XY" must be followed by either attribute specifications, ">" or "/>".]

The whole application is running in a microservice architecture, but there are no dependencies to other services. On startup there happens a lot as each microservice initializes his own state. For me it seems, that there might be some memory issues as it's not reproducible and the microservices on the devices don't differ in their versions.

Before optimizing unmarshalling process I would like to be able to reproduce the issue first to ensure, that any improvements are working. When I try to reduce Xmx and Xms I'll might get OutOfMemoryException but never XMLStreamException.

Right now I'm asking myself,

  • When and why may XMLStreamException occur and how can I reproduce this behaviour?
  • Why this may happen not frequently, as all devices should be the same?
  • Should I switch to SAX which is more memory-efficient?

Thanks for all help in advance.

codeStyler
  • 31
  • 2
  • Maybe harddisk failure. Or are you sure no other process writes to the file while you are reading it? For debugging you can log the whole input stream with something like [Apache Commons IO TeeInputStream](https://commons.apache.org/proper/commons-io/javadocs/api-release/org/apache/commons/io/input/TeeInputStream.html). So you can see exactly the same bytes presented to your XML parser. – vanje Jul 27 '20 at 11:27
  • Once the file is deployed, it cannot be changed. This ensures, that no other process will modify it meanwhile. Thanks for the hint, I will probably give a try to debug the input stream. – codeStyler Jul 27 '20 at 11:43
  • @vanje: No, it's almost certainly due to [data variation](https://stackoverflow.com/a/63116079/290085), not device failure, but you're right to recommend that the *exact* XML causing the error be logged to facilitate debugging. – kjhughes Jul 27 '20 at 13:14

1 Answers1

0

There's not enough information in your question to allow a definitive answer, but we can help you hone in on the problem.

  1. The variations you're seeing are almost certainly due to input variations, not device failures.

  2. The errors indicate that the stream is not well-formed XML. (The textual data is technically not even XML; it's causing a pre-validation parsing error.)

  3. Here is a simple example of not-well-formed XML that would generate such errors:

    <r a='''/>
    

    Notice that there's an unescaped ' within an attribute value. This can easily happen when code pulls data from a source, fails to escape it, and writes it into an attribute value. The variability would arise from data variability. For example, most names do not have ' in them, but O'Toole does.

Log the exact XML that's failing as a next step to debug the problem, as mentioned by @vanje in comments.

See also

kjhughes
  • 106,133
  • 27
  • 181
  • 240