1

I've been using JAXB for a while now to parse xml that looks roughly like this:

<report>    <-- corresponds to a "wrapper" object that holds 
                some properties and two lists - a list of A's and list of B's
    <some tags with> general <info/>
    ...
    <A>   <-- corresponds to an "A" object with some properties
        <some tags with> info related to the <A> tag <bla/>
        ...
    <A/>
    <B>   <-- corresponds to an "B" object with some properties
        <some tags with> info related to the <B> tag <bla/>
        ...
    </B>
</report>

The side responsible of marshalling the xml is terrible but is out of my control.
It often sends invalid xml chars and/or malformed xml.
I talked to the side responsible and got lots of errors fixed, but some they just can't seem to fix.
I want my parser to be as forgiveful as possible to these errors, and when it's not possible, to get as much info as possible from the the xml with the errors.
So if the xml contains 100 A's and one has a problem, I would still like to be able to keep the other 99.
These are my most common problems:

1. Some info tag inner value contains invalid chars
    <bla> invalid chars here, either control chars or just &>< </bla>
2. The root entity is missing a closing tag
    <report> ..... stuff here .... NO </report> at the end!
3. An inner entity (A/B)  is missing it's closing tag, or it's somehow malformed.
    <A> ...stuff here... <somethingMalformed_blabla_A/>
    OR
    <A> ...  Something malformed here...</A>

I hoped I explained myself well.
I really want to get as much info as possible from these xml's, even when they have problems.
I guess I need to employ some strategy that uses stax/sax along with JAXB but I'm not sure how.
If of 100 A's, one A has a xml problem I don't mind throwing just that A.
Although it would be much better if I could get an A object with as much data that could be parsed until the error.

samz
  • 1,592
  • 3
  • 21
  • 37

2 Answers2

2

The philosphy of XML is that creators of XML are responsible for creating well-formed XML, recipients are not responsible for repairing bad XML on arrival. XML parsers are required to reject ill-formed XML. There are other "tidy" tools that may be able to convert bad XML into good XML, but depending on the nature of the flaws in the input, it's unpredictable how well they will work. If you're going to get the benefits of using XML for data interchange, it needs to be well-formed. Otherwise you might just as well use your own proprietary format.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
  • Sadly, in real life, not everyone adheres to this philosophy. In my scenario the sending side tries to send valid xml (and not some special format), but is unsuccessful for many reasons (mostly their bad code). I have to deal this somehow. I'm trying to use the built in high level java tools (jaxb) as much as possible without building my own parser (going "low level"). Any helpful comments/code would be welcome. – samz Jul 19 '12 at 06:57
  • 1
    Yes, real life is messy. Fortunately, I'm in the position where I can tell you what the right technical solution is, and leave you to deal with the problem that you don't have enough control of the total system to implement the right technical solution. – Michael Kay Jul 20 '12 at 10:13
  • I'm interested in the technical solution. I'm already using JAXB but in a pretty basic way. I don't know how to implement this solution nor have I found any useful info online. That's why I ended up here... – samz Jul 24 '12 at 14:19
  • @MichaelKay How do you think of the idea to repair the malformed xml files? What are the most common errors in those malformed xml files? Could you share some thoughts/insights? – xwang Jul 01 '16 at 19:40
  • I already have shared my thoughts and insights. Don't put up with data that isn't XML. If someone gave you a laptop and the disk was broken, you would send it back, not try to repair it. Don't put up with shoddy quality. – Michael Kay Jul 02 '16 at 19:17
2

This answer really helped me:

JAXB - unmarshal XML exception

In my case, I'm parsing results from Sysinternals Autoruns tool with the XML switch (-x). Either because the results were being written to a file share or for some buggy reason in the newer version, the XML would be malformed near the end. Since this Autoruns capture is critical for malware investigations, I really wanted the data. Plus I could tell from the file size that the results were nearly complete.

The solution in the linked question works really well when you have a document with many sub-elements as suggested by the OP. In particular, the Autoruns XML output is really simple and consists of many "items", each consisting of a many simple elements with text (i.e. String properties as generated by XJC). So if a few items are missed at the end, no big deal... unless of course it's something related to malware. :)

Here's my code:

public class Loader {

    private List<Exception> exceptions = new ArrayList<>();

    public synchronized List<Exception> getExceptions() {
        return new ArrayList<>(exceptions);
    }

    protected void setExceptions(List<Exception> exceptions) {
        this.exceptions = exceptions;
    }

    public synchronized Autoruns load(File file, boolean attemptRecovery)
      throws LoaderException {
        Unmarshaller unmarshaller;
        try {
            JAXBContext context = newInstance(Autoruns.class);
            unmarshaller = context.createUnmarshaller();
        } catch (JAXBException ex) {
            throw new LoaderException("Could not create unmarshaller.", ex);
        }
        try {
            return (Autoruns) unmarshaller.unmarshal(file);
        } catch (JAXBException ex) {
            if (!attemptRecovery) {
                throw new LoaderException(ex.getMessage(), ex);
            }
        }
        exceptions.clear();
        Autoruns autoruns = new Autoruns();
        XMLInputFactory inputFactory = XMLInputFactory.newInstance();
        try {
            XMLEventReader eventReader = 
              inputFactory.createXMLEventReader(new FileInputStream(file));
            while (eventReader.hasNext()) {
                XMLEvent event = eventReader.peek();
                if (event.isStartElement()) {
                    StartElement start = event.asStartElement();
                    if (start.getName().getLocalPart().equals("item")) {
                         // note the try should allow processing of elements
                         // after this item in the event it is malformed
                         try {
                            JAXBElement<Autoruns.Item> jax_b = 
                              unmarshaller.unmarshal(eventReader,
                                                     Autoruns.Item.class);
                            autoruns.getItem().add(jax_b.getValue());
                        } catch (JAXBException ex) {
                            exceptions.add(ex);
                        }
                    }
                }
                eventReader.next();
            }
        } catch (XMLStreamException | FileNotFoundException ex) {
            exceptions.add(ex);
        }
        return autoruns;
    }

    public static Autoruns load(Path path) throws JAXBException {
        return load(path.toFile());
    }

    public static Autoruns load(File file) throws JAXBException {
        JAXBContext context = JAXBContext.newInstance(Autoruns.class);
        Unmarshaller unmarshaller = context.createUnmarshaller();
        return (Autoruns) unmarshaller.unmarshal(file);
    }

    public static class LoaderException extends Exception {

        public LoaderException(String message) {
            super(message);
        }

        public LoaderException(String message, Throwable cause) {
            super(message, cause);
        }
    }
}
Community
  • 1
  • 1
Kevin
  • 36
  • 2