How to create/write a simple XML parser from scratch?

Question

Rather than code samples, I want to know what are the simplified, basic steps in English.

How is a good parser designed? I understand that regex should not be used in a parser, but how much is regex's role in parsing XML?

What is the recommended data structure to use? Should I use linked lists to store and retrieve nodes, attributes, and values?

I want to learn how to create an XML parser so that I can write one in D programming language.

Unfortunately, Googling "document building parser" only leads back to this question. If you create an answer, maybe you could address the difference between an event-driven parser and document-building parser. — XP1, Jun 04 '11 at 22:44
I'll note that there is no such language as "simple XML". If you are planning to parse XML, then your parser should parse all of XML, not just some of it. The reason is simple: today you may only need "simple" XML, but tomorrow, your code will likely be asked to parse "real XML". — John Saunders, Jun 05 '11 at 02:07
@JohnSaunders I think he meant a simple parser, not simple XML. — Sundeep, Dec 02 '13 at 05:42

score 15 · Answer 1 · answered Jun 05 '11 at 09:25

If you don't know how to write a parser, then you need to do some reading. Get hold of any book on compiler-writing (many of the best ones were written 30 or 40 years ago, e.g. Aho and Ullmann) and study the chapters on lexical analysis and syntax analysis. XML is essentially no different, except that the lexical and grammar phases are not as clearly isolated from each other as in some languages.

One word of warning, if you want to write a fully-conformant XML parser then 90% of your effort will be spent getting edge cases right in obscure corners of the spec dealing with things such as parameter entities that most XML users aren't even aware of.

Curious, what data structure would be best for this task? My gut instinct says a general tree, and not knowing if OP also wants to build this from scratch, he/she might be in for a lengthy project. — j9000, Feb 07 '20 at 15:38

score 9 · Answer 2 · edited Jul 10 '15 at 16:57

for and event based parser the user need to pass it some functions (startNode(name,attrs), endNode(name) and someText(txt) likely through an interface) and call them when needed as you pass over the file

the parser will have a while loop that will alternate between reading until < and until > and do the proper conversions to the parameter types

void parse(EventParser p, File file){
    string str;
    while((str = file.readln('<')).length !=0){
        //not using a rewritable buffer to take advantage of slicing 
        //but it's a quick conversion to a implementation with a rewritable buffer though
        if(str.length>1)p.someText(str.chomp('<'));


        str = file.readln('>');
        str = str.chomp('>');

        //split str in name and attrs
        auto parts = str.split();
        string name = parts[0];
        string[string] attrs;
        foreach(attribute;parts[1..$]){
            auto splitAtrr = attribute.split("=");
            attrs[splitAtrr[0]] = splitAtrr[1];
        }

        if(str[0] == '/')p.endNode(name);
        else {
            p.startNode(name,attrs);
            if(str[str.length-1]=='/')p.endNode(name);//self closing tag
        }
    }
}

you can build a DOM parser on top of a event based parser and the basic functionality you'll need for each node is getChildren and getParent getName and getAttributes (with setters when building ;) )

the object for the dom parser with the above described methods:

class DOMEventParser : EventParser{
    DOMNode current = new RootNode();
    overrides void startNode(string name,string[string] attrs){
        DOMNode tmp = new ElementNode(current,name,attrs);
        current.appendChild(tmp);
        current = tmp;
    }
    overrides void endNode(string name){
        asser(name == current.name);
        current = current.parent;
    }
    overrides void someText(string txt){
        current.appendChild(new TextNode(txt));
    }
}

when the parsing ends the rootnode will have the root of the DOM tree

note: I didn't put any verification code in there to ensure correctness of the xml

edit: the parsing of the attributes has a bug in it, instead of splitting on whitespace a regex is better for that

score 7 · Answer 3 · edited May 06 '14 at 13:47

There is a difference between a parser and a nodelist. The parser is the piece that takes a bunch of plain text XML and tries to determine what nodes are in there. Then there is an internal structure you save the nodes in. In a layer over that structure you find the DOM, the Document Object Model. This is a structure of nested nodes that make up your XML document. The parser only needs to know the generic DOM interface to create nodes.

I wouldn't use regex as a parser for this. I think the best thing is just traverse the string char by char and check if what you get matches with what you should get.

But why not use any of the existing XML parsers? There are many possibilities in encoding data. Many exceptions. And if your parsers doesn't manage them all it is hardly worth the title of XML parser.

score 2 · Answer 4 · edited Jul 10 '15 at 17:05

A parser must fit the needs of your input language. In your case, simple XML. The first thing to know about XML is that it is context-free and absolutely not ambiguous, everything is wrapped between two tokens, and this is what makes XML famous: it is easy to parse. Finally, XML is always simply represented by a tree structure. As stated, you can simply parse your XML and execute code in the meantime, or parse the XML, generating the tree, and then execute code according to this tree.

D provides a very interesting way to write an XML parser very easily, for example:

doc.onStartTag["pointlight"] = (ElementParser xml)
{
  debug writefln("Parsing pointlight element");

  auto l = new DistantLight(to!int(xml.tag.attr["x"]),
                            to!int(xml.tag.attr["y"]),
                            to!int(xml.tag.attr["z"]),
                            to!ubyte(xml.tag.attr["red"]),
                            to!ubyte(xml.tag.attr["green"]),
                            to!ubyte(xml.tag.attr["blue"]));
  lights ~= l;

  xml.parse();
};

I've never heard of a language named "simple XML". Can you provide a link? Is it an International standard? — John Saunders, Jun 05 '11 at 02:06
By simple XML I mean . Then you have things like html that is basically XML but does not respect this "standard", e.g.
is allowed and must be handled by the parser. Another question ? — Julio Guerra, Jun 05 '11 at 13:59

score 1 · Answer 5 · edited Sep 14 '17 at 05:48

The first element in the document should be the prolog. This states the xml version, the encoding, whether the file is standalone, and maybe some other stuff. The prolog opens with <?.

After the prolog, there's tags with metadata. The special tags, like comments, doctypes, and element definitions should start with <!. Processing instructions start with <?. It is possible to have nested tags here, as the <!DOCTYPE tag can have <!ELEMENT and <!ATTLIST tags in a dtd style xml document--see Wikipedia for a thorough example.

There should be exactly one top level element. It's the only one without a <! or a <? preceding it. There may be more metadata tags after the top level element; process those first.

For the explicit parsing: First identify tags--they all start with <--then determine what kind of tag it is and what its closure looks like. <!-- is a comment tag, and cannot have -- anywhere except for its end. <? ends with ?>. <! end with >. To repeat: <!DOCTYPE can have tags nested before its closure, and there may be other nested tags I don't know of.

Once you find a tag, you'll want to find its closing tag. Check if the tag is self closing first; otherwise, find its closure.

For data structures: I would recommend a tree structure, where each element is a node, and each node has an indexed/mapped list of subelements.

Obviously, a full parser will require a lot more research; I hope this is enough to get you started.

score 0 · Answer 6 · answered Aug 01 '11 at 13:57

Since D is rather closely related to Java, maybe generating an XML parser with ANTLR (since there are most probably XML EBNF grammars for ANTLR already, you could then use these), and then converting the generated Java parser code to D, could be an option? At least that would give you a starting point, and you could then put some efforts in trying optimizing the code specifically for D ...

At least ANTLR is not at all as hard as many seem to think. I got started after knowing nothing about it, by watching 3-4 of this great set of screencasts on ANTLR.

Btw, I found ANTLRWorks a breeze to work with (as opposed to the Eclipse plugin used in the screencast ... but the screencast content applies anyway).

Just my 0.02c.

How to create/write a simple XML parser from scratch?

6 Answers6

Linked