9

before you start linking to RegEx match open tags except XHTML self-contained tags read whole question.

I'd like to write an HTML parser (only for HTML 5, it should check if it is HTML 5 and if not, return an error) just to learn myself something new, but I don't know what is the best way to do that. Let me show you an example:

<!doctype html>
<html>
<head>
    <!-- #TITLE -->
    <title>Just an example</title>
</head>
<body>
    <p class='main'>Simple paragraph with an <a href='/a.html'>anchor</a></p>
</body>
</html>

Now, could anyone show me how to parse this (final form doesn't matter, just a concept)? I had some ideas (like using recursive functions, or making references to array which holds actual tag), but I don't think these were the best concepts. Should I check char by char and then call specific functions or use regular expressions (explained below)?

By using regular expressions I don't mean one pattern for whole tag. I rather mean using one pattern for tagname (and if this one returns true, check next patterns), then for attribute (and if this one returns true, check again), and lastly check for end of tag.

What should I do when I find tag? Run a loop which checks for tags (and if it finds tag, call it again and again...)? But for me it seems like recursive function or at least half-recursive when function X calls Y which calls X...

So the final question is: what is the most efficient and correct structure for that?

Community
  • 1
  • 1
user1951214
  • 93
  • 1
  • 4
  • 2
    I don't think your answer help me... I saw that question before and I wrote in my question "**By using regular expressions I don't mean one pattern for whole tag.**" And by the way, how did you read this in less than 2 minutes? – user1951214 Aug 01 '13 at 16:21
  • Does this answer your question? [Writing an HTML Parser](https://stackoverflow.com/questions/7192101/writing-an-html-parser) – ggorlen Nov 11 '20 at 22:00

2 Answers2

7

The biggest part of writing an SGML-based parser is the lexer. Here's an article about building a custom lexer: http://onoffswitch.net/building-a-custom-lexer/.

In my opinion, regular expressions would probably be overkill/inappropriate - you want to be matching HTML tokens and character by character parsing is probably the best way to do this.

Kian
  • 1,260
  • 1
  • 13
  • 32
  • 1
    +1 for mentioning a lexer. That's what the OP really needs to be looking at. For further points about the need for a lexer, @JXG's comment in this thread is instructive: http://stackoverflow.com/questions/2400623/how-do-html-parses-work-if-theyre-not-using-regexp?rq=1 – SWalters Aug 01 '13 at 17:29
  • 1
    Hi, I'm the author of that blog post, just wanted to mention that if you know your grammar you can get up and running faster using a parser generator like ANTLR. Building the whole lexer from scratch could be overkill depending on what you want. Also, it'd be worth looking into parser combinators and combinator libraries to get an AST without having to go through a tokenize stage – devshorts Jan 27 '14 at 19:30
  • 1
    link dead :(((( – JBis May 12 '20 at 05:00
7

@Kian's answer mentions using a lexer, but in terms of algorithms I think you'll want to use recursion. HTML is after all a recursive structure:

<div>
    <div>
        <div>
        </div>
    </div>
</div>

Here is a naive JS example - although it's not a complete implementation. (I've included no support for <empty /> elements; for <!-- comments -->; for &entities;; for xmlns:namespaces... writing a full fledged HTML or XML parser is a huge undertaking, so don't take it lightly)

This solution notably skips over the process of lexical analysis, but I've deliberately omitted that to contrast my answer with @Kian's.

var markup = "<!DOCTYPE html>\n"+
             "<html>\n"+
             " <head>\n"+
             "   <title>Example Input Markup</title>\n"+
             " </head>\n"+
             " <body>\n"+
             "   <p id=\"msg\">\n"+
             "     Hello World!\n"+
             "   </p>\n"+
             " </body>\n"+
             "</html>";

parseHtmlDocument(markup);

// Function definitions

function parseHtmlDocument(markup) {
    console.log("BEGIN DOCUMENT");
    markup = parseDoctypeDeclaration(markup);
    markup = parseElement(markup);
    console.log("END DOCUMENT");
}

function parseDoctypeDeclaration(markup) {
    var regEx = /^(\<!DOCTYPE .*\>\s*)/i;
    console.log("DOCTYPE DECLARATION");
    var matches = regEx.exec(markup);
    var doctypeDeclaration = matches[1];
    markup = markup.substring(doctypeDeclaration.length);
    return markup;
}

function parseElement(markup) {
    var regEx = /^\<(\w*)/i;
    var matches = regEx.exec(markup);
    var tagName = matches[1];
    console.log("BEGIN ELEMENT: "+tagName);
    markup = markup.substring(matches[0].length);
    markup = parseAttributeList(markup);
    regEx = /^\>/i;
    matches = regEx.exec(markup);
    markup = markup.substring(matches[0].length);
    markup = parseNodeList(markup);
    regEx = new RegExp("^\<\/"+tagName+"\>");
    matches = regEx.exec(markup);
    markup = markup.substring(matches[0].length);
    console.log("END ELEMENT: "+tagName);
    return markup;
}

function parseAttributeList(markup) {
    var regEx = /^\s+(\w+)\=\"([^\"]*)\"/i;
    var matches;
    while(matches = regEx.exec(markup)) {
        var attrName = matches[1];
        var attrValue = matches[2];
        console.log("ATTRIBUTE: "+attrName);
        markup = markup.substring(matches[0].length);
    }
    return markup;
}

function parseNodeList(markup) {
    while(markup) {
        markup = parseTextNode(markup);
        var regEx = /^\<(.)/i;
        var matches = regEx.exec(markup);
        if(matches[1] !== '/') {

            markup = parseElement(markup);
        }
        else {
            return markup;
        }
    }
}

function parseTextNode(markup) {
    var regEx = /([^\<]*)\</i;
    var matches = regEx.exec(markup);
    markup = markup.substring(matches[1].length);
    return markup;
}

Ideally each of these functions would map very closely onto the grammar defined in the XML specification. For example, the specification defines an element like so:

element    ::=    EmptyElemTag | STag content ETag

... so ideally we'd want the parseElement() function to look more like this:

function parseElement(markup) {
    if(nextTokenIsEmptyElemTag) { // this kind of logic is where a lexer will help!
        parseEmptyElemTag(markup);
    }
    else {
        parseSTag(markup);
        parseContent(markup);
        parseETag(markup);
    }
}

... but I've cut some corners in writing my example, so it doesn't reflect the actual grammar as closely as it should.

Richard JP Le Guen
  • 28,364
  • 7
  • 89
  • 119