0

I have a XML file of which I do not know the general schema. In this XML I am trying to parse table elements that I do know the schema of; they are in the format of standard html tables.

I am ignoring all dtd references using this answer and extracted my table nodes like this:

    DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
    dbf.setValidating(false);
    dbf.setNamespaceAware(true);
    dbf.setFeature("http://xml.org/sax/features/namespaces", false);
    dbf.setFeature("http://xml.org/sax/features/validation", false);
    dbf.setFeature("http://apache.org/xml/features/nonvalidating/load-dtd-grammar", false);
    dbf.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
    DocumentBuilder db = dbf.newDocumentBuilder();
    Document doc = db.parse(input);
    NodeList tables = doc.getElementsByTagName("table");

Given I have a domain class for table, is there an easy way to map those Nodes to java objects?

T A
  • 1,677
  • 4
  • 21
  • 29

1 Answers1

1

First, if you happen to be parsing HTML you really need to use an HTML parser because the XML parser will fail to parse it. That said, with a little coding you can do this using SimpleXml.

These are the steps:

  1. Create an instance of the parser (no configuration necessary, DTDs and namespaces are deliberately left out of the parser)
  2. Parse the XML into a DOM tree
  3. Use a getElementsByTagName("table") to get all the table elements
  4. Loop through the list and convert DOM elements to POJOs

It turns out that SimpleXml doesn't have a getElementsByTagName() out of the box so I wrote one. Here is the full code:

import xmlparser.XmlParser;
import xmlparser.annotations.XmlName;
import xmlparser.annotations.XmlTextNode;
import xmlparser.model.XmlElement;

import java.util.ArrayList;
import java.util.List;

public final class Question {

    private static final String xml =
        "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n"+
        "<!DOCTYPE note SYSTEM \"Note.dtd\">\n" +
        "<note>\n" +
            "<to>Tove</to>\n" +
            "<from>Jani</from>\n" +
            "<heading>Reminder</heading>\n" +
            "<body>Don't forget me this weekend!</body>\n" +
            "<sometag>\n" +
                "<table>\n" +
                    "<tr><td>content</td></tr>\n" +
                "</table>\n" +
            "</sometag>\n" +
        "</note>\n";

    @XmlName("table")
    private static class Table {
        private List<Question.Row> tr;
    }
    private static class Row {
        private List<Cell> td;
    }
    private static class Cell {
        @XmlTextNode
        private String text;
    }

    public static void main(final String... args) {
        final XmlParser simple = new XmlParser();
        final XmlElement xmlElement = simple.fromXml(xml);
        final List<XmlElement> list = getElementsByTagName(xmlElement, "table");
        for (final XmlElement element : list) {
            Table table = simple.fromXml(element, Table.class);
            System.out.println(table.tr.get(0).td.get(0).text);
        }
    }

    private static List<XmlElement> getElementsByTagName(final XmlElement element, final String name) {
        final List<XmlElement> list = new ArrayList<>();
        getElementsByTagName(element, name, list);
        return list;
    }
    private static void getElementsByTagName(final XmlElement element, final String name, final List<XmlElement> list) {
        if (element == null) return;
        if (name.equals(element.name)) list.add(element);
        if (element.children == null) return;
        for (final XmlElement child : element.children) {
            getElementsByTagName(child, name, list);
        }
    }

}

The output of the code will be a single line with the word 'content'.

SimpleXml is in maven central: https://mvnrepository.com/artifact/com.github.codemonstur/simplexml/2.8.1

jurgen
  • 325
  • 1
  • 11
  • Thanks for your answer! I will check if the implementation works for me this week and accept your answer if it does. – T A Nov 10 '20 at 11:20