10

I've been working on learning some new tech using java to parse files and for the msot part it's going well. However, I'm at a lost as to how I could parse an xml file to where the structure is not known upon receipt. Lots of examples of how to do so if you know the structure (getElementByTagName seems to be the way to go), but no dynamic options, at least not that I've found.

So the tl;dr version of this question, how can I parse an xml file where I cannot rely on knowing it's structure?

canadiancreed
  • 1,966
  • 6
  • 41
  • 58
  • 1
    Parsers parse XML without caring about their structure. The only requirement is that it's well formed. Unless you have a validating parser (where the parser will also compare the XML with a schema that describes the structure), it will parse your XML. A method like getElementByTagName is called on an object model of the already parsed XML. Perhaps you want to know how to read the data of a parsed object model. – helderdarocha Feb 23 '14 at 01:51
  • Can you give maybe any examples? What do you want to parse out of that unknown structure? Is it completely unknown or just some part of it? – therealmarv Feb 23 '14 at 01:58

1 Answers1

14

Well the parsing part is easy; like helderdarocha stated in the comments, the parser only requires valid XML, it does not care about the structure. You can use Java's standard DocumentBuilder to obtain a Document:

InputStream in = new FileInputStream(...);
Document doc = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(in);

(If you're parsing multiple documents, you can keep reusing the same DocumentBuilder.)

Then you can start with the root document element and use familiar DOM methods from there on out:

Element root = doc.getDocumentElement(); // perform DOM operations starting here.

As for processing it, well it really depends on what you want to do with it, but you can use the methods of Node like getFirstChild() and getNextSibling() to iterate through children and process as you see fit based on structure, tags, and attributes.

Consider the following example:

import java.io.ByteArrayInputStream;
import java.io.InputStream;
import javax.xml.parsers.DocumentBuilderFactory;   
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;


public class XML {

    public static void main (String[] args) throws Exception {

        String xml = "<objects><circle color='red'/><circle color='green'/><rectangle>hello</rectangle><glumble/></objects>";

        // parse
        InputStream in = new ByteArrayInputStream(xml.getBytes("utf-8"));
        Document doc = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(in);

        // process
        Node objects = doc.getDocumentElement();
        for (Node object = objects.getFirstChild(); object != null; object = object.getNextSibling()) {
            if (object instanceof Element) {
                Element e = (Element)object;
                if (e.getTagName().equalsIgnoreCase("circle")) {
                    String color = e.getAttribute("color");
                    System.out.println("It's a " + color + " circle!");
                } else if (e.getTagName().equalsIgnoreCase("rectangle")) {
                    String text = e.getTextContent();
                    System.out.println("It's a rectangle that says \"" + text + "\".");
                } else {
                    System.out.println("I don't know what a " + e.getTagName() + " is for.");
                }
            }
        }

    }

}

The input XML document (hard-coded for example) is:

<objects>
    <circle color='red'/>
    <circle color='green'/>
    <rectangle>hello</rectangle>
    <glumble/>
</objects>

The output is:

It's a red circle!
It's a green circle!
It's a rectangle that says "hello".
I don't know what a glumble is for.
Jason C
  • 38,729
  • 14
  • 126
  • 182