Parsing simple HTML into tree

Question

I'd like to ask what is the best way to parse a simple html code into DOM Node tree like this one:

example tree

Here are some constraints I am facing:

HTML code will have only pair tags, no attributes and I've to ignore spaces
There can be Text between tags like <p>, <h1>, <a> etc.
I can't use libraries

I was thinking about regex, but never tried it so.. Any ideas?

Every node in the tree is this struct:

  typedef struct tag
  {
      struct tag* parent;
      struct tag* nextSibling;
      struct tag* previousSibling;
      struct tag* firstChild;
      struct tag* lastChild;     
      char* name;
      char* text;     
  }node;

1) don't use regex unless the markup is very predictable and 2) a library would be far, far easier than writing your own. That said, you could implement your own state machine: http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html. The formal definition is quite large, but you could pick and choose what features/behaviors you wish to implement. — Tim M., Mar 26 '13 at 21:50
here is a code you may find helpful: http://www.w3.org/Library/src/HTML.c — Grijesh Chauhan, Mar 26 '13 at 21:51
@TimMedora oh, ok, I thought regex would be good, I know library would be better, but that's school homework and we can't use libraries :/ — user2213470, Mar 26 '13 at 21:56
[The obligatory link on regexes](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). Despite the valid point made by the accepted answer, there *are* ways to parse HTML with regexes...it can just get out of control very, very quickly. — Tim M., Mar 26 '13 at 21:59
Since you mention this is homework, I'd try to write a very simple state machine instead of using regular expressions. That's the right way to do it, and a valuable learning experience if you haven't done it before. — Tim M., Mar 26 '13 at 22:08

score 3 · Answer 1 · edited Jun 30 '18 at 23:46

I know it isin't in C, but that presentation might give you some input on how you could efficiently tackle the problem.

https://web.archive.org/web/20120115060003/http://cuddle.googlecode.com/hg/talk/lex.html#landing-slide

I also wrote a very simple parser example in JavaScript (again not in C, but hopefully you know JS as well) based on your initial requirements, meaning that it will not parse any attributes and do not handle self-closing tags and many other things that should be handled according to the HTML specs. It will produce a parse tree in this format:

{
    cn: [{
        tag: 'html',
        cn: [{
            tag: 'body',
            cn: [
                { tag: 'h1', cn: ['test'] },
                ' some text ',
                ...
            ]
        }] 
    }]
}

Here's the code and fiddle: http://jsfiddle.net/LUpyZ/3/

Note that white space is not ignored and will be captured in text nodes.

var html = '<html><body><h1>test</h1> some text <div> <p>text</p></div></body></html>';

var parseHTML = (function () {
    var nodesStack = [],
        i = 0,
        len = html.length,
        stateFn = parseText,
        parseTree = { cn: [] },
        alphaNumRx = /\w/,
        currentNode = parseTree,
        text = '',
        tag = '',
        newNode;

    function parseTag(token) {
        if (token === '/') {
            return parseCloseTag;
        }

        i--; //backtrack to first tag character
        return parseOpenTag;
    }

    function parseCloseTag(token) {
        if (token === '>') {
            if (currentNode.tag !== tag) {
                throw 'Wrong closed tag at char ' + i;
            }

            tag = '';

            nodesStack.pop();

            currentNode = currentNode.parentNode;

            return parseText;            
        }

        assertValidTagNameChar(token);

        tag += token;

        return parseCloseTag;
    }

    function parseOpenTag(token) {
        if (token === '>') {
            currentNode.cn.push(newNode = { tag: tag, parentNode: currentNode,  cn: []});
            nodesStack.push(currentNode = newNode);

            tag = '';

            return parseText;
        }

        assertValidTagNameChar(token);

        tag += token;

        return parseOpenTag;
    }

    function parseText(token) {
        if (token === '<') {

            if (text) {
                currentNode.cn.push(text);
                text = '';
            }

            return parseTag;
        }

        text += token;

        return parseText;
    }

    function assertValidTagNameChar(c) {
        if (!alphaNumRx.test(c)) {
            throw 'Invalid tag name char at ' + i;
        }
    }

    return function (html) {
        for (; i < len; i++) {
            stateFn = stateFn(html[i]);
        }

        if (currentNode = nodesStack.pop()) {
            throw 'Unbalanced tags: ' + currentNode.tag + ' is never closed.';
        }

        return parseTree;
    };
})();

console.log(parseHTML(html));

`if (token === '/') { return parseCloseTag; }` This is not necessary true, as `/` can be part of a URL in the href attribute of the tag for instance — ThunderWiring, May 19 '23 at 07:36
@ThunderWiring wow old answer hehe. I mention in the answer it doesn't cover attributes so that's not a problem. A / in a text node would just be interpreted as text and a / in a tag body can only be a closing tag (assuming there's no attributes). Furthermore that's a very simplistic example, not a production ready parser. — plalx, May 24 '23 at 20:49

Parsing simple HTML into tree

1 Answers1