The first element in the document should be the prolog. This states the xml version, the encoding, whether the file is standalone, and maybe some other stuff. The prolog opens with <?
.
After the prolog, there's tags with metadata. The special tags, like comments, doctypes, and element definitions should start with <!
. Processing instructions start with <?
. It is possible to have nested tags here, as the <!DOCTYPE
tag can have <!ELEMENT
and <!ATTLIST
tags in a dtd style xml document--see Wikipedia for a thorough example.
There should be exactly one top level element. It's the only one without a <!
or a <?
preceding it. There may be more metadata tags after the top level element; process those first.
For the explicit parsing: First identify tags--they all start with <
--then determine what kind of tag it is and what its closure looks like. <!--
is a comment tag, and cannot have --
anywhere except for its end. <?
ends with ?>
. <!
end with >
. To repeat: <!DOCTYPE
can have tags nested before its closure, and there may be other nested tags I don't know of.
Once you find a tag, you'll want to find its closing tag. Check if the tag is self closing first; otherwise, find its closure.
For data structures: I would recommend a tree structure, where each element is a node, and each node has an indexed/mapped list of subelements.
Obviously, a full parser will require a lot more research; I hope this is enough to get you started.