6

I'm trying to validate a very XML (~200MB) against XSD. It's taking almost 3 hours. I'm not sure what am I doing wrong here?

    SchemaFactory sf = SchemaFactory.newInstance(W3C_XML_SCHEMA_NS_URI);
    Schema schema = sf.newSchema(new File(this.productExtraInfoXsd));

    DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
    domFactory.setNamespaceAware(true);
    DocumentBuilder builder = domFactory.newDocumentBuilder();
    Document doc = builder.parse(new File(filePath));

    DOMSource domSource = new DOMSource(doc);
    DOMResult result = new DOMResult();

    Validator validator = schema.newValidator();
    validator.validate(domSource, result);
toy
  • 11,711
  • 24
  • 93
  • 176

2 Answers2

3

check this article on XML unmarshalling from Marco Tedone see here. Based on his you can see Stax

XMLInputFactory xmlInputFactory = XMLInputFactory.newInstance();
XMLStreamReader xmlStreamReader = xmlInputFactory.createXMLStreamReader(fileInputStream);
Validator validator = schema.newValidator();
validator.validate(new StAXSource(xmlStreamReader));
constantlearner
  • 5,157
  • 7
  • 42
  • 64
  • According to what is written in the article, using StAX is slower than using "Pure JAXB unmarshalling" version, just using StAX will save memory. So I think that if @toy wants to speed up the validation process, this is not a good solution ... or am I missing something? – Paolo Nov 04 '13 at 16:52
  • 1
    @Paolo - I don't believe constantlearner meant `unmarshalling` in the JAXB sense. This is simply suggesting using a `StAXSource` instead of a `DOMSource`. – bdoughan Nov 04 '13 at 16:56
  • 1
    @BlaiseDoughan If so than ok, I have actually just found an old stackoverflow topic that is reporting right what you are saying ... – Paolo Nov 04 '13 at 17:02
3

Have a look at this stackoverflow topic. Here is written that:

You should not use the DOMParser to validate a document (unless your goal is to create a document object model anyway). This will start creating DOM objects as it parses the document - wasteful if you aren't going to use them.

Maybe it will be useful!

Community
  • 1
  • 1
Paolo
  • 1,641
  • 11
  • 15
  • 1
    +1. 3 hours does seem excessive for a 200Mb validation, though it depends a bit what's in the schema (e.g. expensive regular expressions). But the first step is to see how long it takes if you don't build a DOM at the same time. – Michael Kay Nov 05 '13 at 08:27