1

My xml file have this look:

<Customers>
    <Customer>
        <name>foo</name>
        <age>18</age>
        <sexe>Male</sexe>
    <Customer>

    <Customer>
         <name>foo1</name>
         <age>25</age>
         <sexe>Female</sexe>
    <Customer>
        .
        .
        .
        .
        .
        .
</Customers>

It's a huge XML file(over hundred thousands of customers) which I need to unmarshall then put into my database, it's a monthly task.

I need to make some validation on it and then if the customer is correct save it into the database, if any data of the customer is incorrect, log the error and skip this customer.

I was thinking about writing my validation rules into the xsd then during unmarshalling using the ValidationEventHandler ignore the whole customer.

Anyone have any idea how I can do that? Or any other solution?

I've been searching for hours on the web and haven't find any answer.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
vacolane
  • 167
  • 1
  • 6
  • can you show the JAX-B code you have tried so far? Have you modeled the customer XML scheme using a POJO w/ JAX-B annotations yet? I would imagine you can just attempt to parse each customer with JAX-B in a try/catch block, and if the schema doesn't match some sort of exception would be thrown – Andy Guibert Oct 04 '18 at 02:37
  • I've generated my pojo using eclipse. It works fine but haven't implement any control yet. Before coding with something new I read documentation first. It's my first time using xml. Today 'll try to do what you said and come back here whether I succeed or not. – vacolane Oct 04 '18 at 05:36

1 Answers1

0

Variant 1 - The XML approach

The common XML processing approach is to separate validation and parsing. The validation step is usually done with the aim of XML Schema.

  1. XML validation is always applied at document level. Maybe splitting up the document in before (e.g. one DOM per record) can help in your case. See: how to split an XML file into multiple XML files using java
  2. Validate - You can use tools like trang to create basic XSD
  3. Sort out problematic entries from the source document (manual task - blame the data provider ?)
  4. Deserialize only the good ones

Variant 2 - The pure Java

It is also possible to use a library like Jackson FasterXML to create a rather lax mapping of XML data to Java classes. Find here an example on how to read only certain properties for each entry in a list.

  1. Lax deserialization of all data into standard POJO
  2. Validate each POJO in an additional java post processing step.

Variant 3 - Something different

Find byte offsets for each customer and read each customer to a well prepared POJO. Log exceptions and continue with the next one. Find complete approach described here.

  1. Create list of byte offsets
  2. Strict deserialization to your POJO
jschnasse
  • 8,526
  • 6
  • 32
  • 72
  • I like the xml approach. How can I deserialize only the good one? In my event handler I can only return true to continue or false to stop all... – vacolane Oct 04 '18 at 09:32
  • By my understanding the ValidationEventHandler is used for the whole document, not per record. You have to modify/fix the document by hand (step 2) to separate the invalid entries from the rest. – jschnasse Oct 04 '18 at 09:53
  • Another option: [Split the DOM by record](https://stackoverflow.com/questions/29166170/how-to-split-an-xml-file-into-multiple-xml-files-using-java) – jschnasse Oct 04 '18 at 09:56
  • Thank you so much for your help. I finally choose the java one. I put my controls inside the setters of my xml pojo and throw an exception if something is wrong, works like a charm. You helped me a lot making that choice. And still I'll split the xml for performance optimisation. – vacolane Oct 05 '18 at 11:34
  • 1
    I choose the java method because my xml file is invalid and I can't ask the company who give me the xml to make a real and nice xml. I can't make a real xsd validation. – vacolane Oct 05 '18 at 11:55