0

I have a XML file and its XSD schema. I am able to validate the XML file and implement a custom org.xml.sax.ErrorHandler like the following:

class MyErrorHandler implements ErrorHandler{
  ...
  @Override
  public void warning(SAXParseException exception) throws SAXException {
    System.out.println("Line: " + exception.getLineNumber() + ") " + exception.getMessage() + exception);
    warnings++;
  }
...
}

Is it possible to actually manipulate the element causing the exception, for example by removing it from the XML file?

Two notes:

  • the XML manipulation doesn't need to be in-place, i.e. I can produce a new file with the failing elements removed;
  • best would be to be able to also get the parent element of the one causing the exception, so that I can decide whether to remove the parent altogether.

Also just a suggestion on which direction to follow in order to solve the problem is appreciated. Thanks!

kjhughes
  • 106,133
  • 27
  • 181
  • 240
Niccolò
  • 2,854
  • 4
  • 24
  • 38

2 Answers2

4

Automatic repair of an XML document is not possible in the general case.

In only very limited contexts would the repair necessary to make an XML document valid be automatically discernable from any given validation error. There is not a one-to-one mapping between validation errors and ways of remedying them.

Consider element r with a through e children:

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">

  <xsd:element name="r">
    <xsd:complexType>
      <xsd:sequence>
        <xsd:element name="a"/>
        <xsd:element name="b"/>
        <xsd:element name="c"/>
        <xsd:element name="d"/>
        <xsd:element name="e"/>
      </xsd:sequence>
    </xsd:complexType>
  </xsd:element>

</xsd:schema>

An XML document such as this one,

<r>
  <a/>
  <x/>
  <b/>
  <c/>
  <d/>
  <e/>
</r>

would yield a validation message such as the following by Xerces-J:

[Error] try.xml:5:7: cvc-complex-type.2.4.a: Invalid content was found starting with element 'x'. One of '{b}' is expected.

You might here automatically remove x, and all would be fine. (Or, you might insert a b, which would not be fine.)

However, for the same XSD, consider that this XML document,

<r>
  <a/>
  <c/>
  <d/>
  <e/>
</r>

would yield a validation message such as the following by Xerces-J:

[Error] try.xml:5:7: cvc-complex-type.2.4.a: Invalid content was found starting with element 'c'. One of '{b}' is expected.

If you automatically removed c, your document would still be invalid, and you'd receive a similar message about d being unexpected. This would continue until your document looked like this,

<r>
  <a/>
</r>

at which point your error message will have returned to the original,

[Error] try.xml:5:5: cvc-complex-type.2.4.b: The content of element 'r' is not complete. One of '{b}' is expected.

As you can see, there's simply not enough information available in a given validation error to know how to repair the XML document in general.

You could do better by consulting the XSD, but this is extremely complex and still not guaranteed to uniquely determine the exact mistake made by the authoring person or system. Automatic repair of an XML document, even given an XSD, is not possible in the general case.

See also

kjhughes
  • 106,133
  • 27
  • 181
  • 240
  • Thanks for your post on how to auto-repair generic XML files, on which I agree. Unfortunately, I believe you misunderstood my question and I feel rather uncomfortable with the edits you made. Using the example in your answer, my scenario is more of the like: given r1 out of a list of elements, if an error occurs validating its children (a, b..) remove r1. The manipulation of my scenario is most likely to be a cut&paste of from the original XML to a logged one then processed as per the business logic (sent it back to poster? human or AI processed?). – Niccolò Dec 06 '16 at 13:19
  • As you see, my scope is far more limited than building a tool to “automatically repair” any given XML provided their schema. That I consider an interesting research maybe for AI and having posted it here would imply I don’t understand its complexity. Thus, my discomfort of having your edited question associated with my username. Your edits also prevents me to receive what I am actually looking for: a way to manipulate XML files with actions triggered by the validation implemented in the javax package! I’d like now to know what to do with my question and your answer..? – Niccolò Dec 06 '16 at 13:19
  • @Niccolò: I am sorry you were unsatisfied with my edits of your question; I have rolled-back your question to its original form. By way of explanation, however, I will say that the essence of your actual *problem* (if not your perception) is still what we've answered: Auto-repair of invalid XML is not possible in general. The new scenario described in your comment does nothing to change this: You've merely renamed the culprit to `r1`, and all the same problem attribution challenges apply to it. – kjhughes Dec 06 '16 at 13:43
  • @Niccolò: If you're completely not asking about how to restore validity in the face of validation errors, then I request that you do this: Allow this question to stand as it is about validity repair (and perhaps allow my edits to help future readers find answers to this topic) and ask a completely new question that more clearly describes what you want to do that doesn't depend upon validity repair. I'll see if I can help there too, and you'll have a fresh audience that isn't distracted by the validity repair problem. Sound good? – kjhughes Dec 06 '16 at 13:48
0

Everything kjhughes says is correct.

However, if there are particular patterns of validation errors in your input, then it's possible to create rules that fix those.

In many cases it's probably simplest to do this by writing XSLT code that detects the incorrect pattern and fixes it without even applying schema validation. For example, if you have a perennial problem with EEE elements where the child XXX element is supposed to precede child YYY but they are often in the wrong order, then you can repair that with a template rule

<xsl:template match="EEE[XXX >> YYY]">
  <xsl:copy>
    <xsl:copy-of select="XXX/preceding-sibling::*, XXX, YYY, YYY/following-sibling::*"/>
  </xsl:copy>
</xsl:template>

The theory in XML Schema is that when you validate a document, the output is not just a yes/no answer, nor even a set of error messages, but rather a document in which individual nodes are marked as valid or invalid, and if invalid, with the error conditions that cause them to be considered invalid. The theory is that you can then explore this document, find the invalidities, and handle them in the appropriate way. However, I don't think there are many tools that implement this, at least not in full.

Recent releases of Saxon's schema processor introduce the InvalidityHandler, which is called to provide complete information about each validity error, and an implementation of this interface, which produces a report of validation errors in XML format. This is designed to enable the construction of tools that do more with the error information than simply putting it in front of the user to ponder. There's certainly a class of validation errors where it would be possible to take the error report and generate XSLT code to correct the error, for example if the input is a set of transactions to be processed then you could create a transaction file that omits those transactions that failed validation.

(Having said that, for this particular use case it might be better to write an XSLT or XQuery application that validates transactions one by one, and uses try/catch to copy only the valid transactions.)

Michael Kay
  • 156,231
  • 11
  • 92
  • 164