How do HTML parsers work?

Question

I've seen the humorous threads and read the warnings, and I know that you don't parse HTML with regex. Don't worry... I'm not planning on trying it.

BUT... that leads me to ask: how are HTML parsers coded (including the built-in functions of programming languages, like DOM parsers and PHP's strip_tags)? What mechanism do they employ to parse the (sometimes malformed) markup?

I found the source of one coded in JavaScript, and it actually uses regex to do the job:

// Regular Expressions for parsing tags and attributes
var startTag = /^<(\w+)((?:\s+\w+(?:\s*=\s*(?:(?:"[^"]*")|(?:'[^']*')|[^>\s]+))?)*)\s*(\/?)>/,
    endTag = /^<\/(\w+)[^>]*>/,
    attr = /(\w+)(?:\s*=\s*(?:(?:"((?:\\.|[^"])*)")|(?:'((?:\\.|[^'])*)')|([^>\s]+)))?/g;

Do they all do this? Is there a conventional, standard way to code an HTML parser?

right after I read your question, that tread you linked had popped up in my head :) — Sungwon Jeong, Feb 18 '11 at 07:49
You can't parse a non-regular language with regexes. You sure can use them to extract information from known regular subsets of the language (which is what they are doing), but that's about it. — NullUserException, Feb 18 '11 at 08:11
@NullUserException: Don’t be ridiculous. I thought everybody knew better than this by now. — tchrist, Feb 18 '11 at 10:57
@NullUserException, @tchrist... did you guys even read the question text? — Peter, Feb 18 '11 at 18:49
@NullUserException: You misunderstand me. I thought everyone by now realized that we are perfectly capable of parsing ɴᴏɴʀᴇɢᴜʟᴀʀ languages with modern patterns. You are the one who has been left out of the loop. — tchrist, Feb 18 '11 at 21:10
@tchrist Regular expressions, by definition, can only parse regular languages. Yes, there are some features in some regex flavors which are irregular and I am [well aware of them](http://stackoverflow.com/questions/3828115/why-arent-modern-regular-expression-dialects-regular/3828131#3828131). However, even with these enhanced regexes, you cannot (properly) parse (X)HTML. — NullUserException, Feb 19 '11 at 01:38
@NullUserException: You’re hung up on old-school formal automata theory: get over it already! Patterns haven’t been ʀᴇɢᴜʟᴀʀ for a really long time now. And don’t tell people what they “can’t” do; you’ll just embarrass yourself when they — or I — show they can. You apparently haven’t read the references I’ve cited. If you had, you would realize that I am perfectly capable and willing to write regexes that are **dynamically self-modifying recursive-descent parsers** in and of themselves. There are more things in heaven and earth than are dreamt of in your automata-theory schoolwork assignments. — tchrist, Feb 19 '11 at 02:09
@tchrist Fair enough. You could also probably find someone who's capable and willing to write a recursive descent parser in x86 assembly, but I personally wouldn't recommend it. To each his own. — NullUserException, Feb 19 '11 at 05:35
@tchrist: Please get practical and provide an example of such a parser working with pcre and is implemented in PHP. Or did you only create hot air? — hakre, Dec 13 '11 at 08:16
If I were going to build an HTML parser from scratch, I would use a parser-generator, such as ANTLR (http://www.antlr.org/). — james.garriss, Nov 17 '14 at 13:03
In fact, someone already has: https://github.com/antlr/grammars-v4/tree/master/html — james.garriss, Nov 17 '14 at 13:37

score 3 · Accepted Answer · edited May 23 '17 at 12:00

I do not know that that style is a “normal” way to do things. It is better than most I’ve seen, but it’s still too close to what I refer to as a “naïve” approach in this answer. For one thing, it isn’t accounting for HTML comments getting in the way of things. There are also legal but somewhat matters of entities it isn’t dealing with. But it’s HTML comments where most such approaches fall down.

A more natural way is to use a lexer to peel off tokens, more like like shown in this answer’s script, then assemble those meaningfully. The lexer would be able to know about the HTML comments easily enough.

You could approach this with a full grammar, such as the one shown here for parsing an RFC 5322 mail address. That is the sort of approach I take in the second, “wizardly” solution in this answer. But even that is only a complete grammar for well-formed HTML, and I’m only interested in a few different sort of tags. Those I define fully, but I don’t define valid fields for tags I’m unconcerned with.

Very nice answer. But how does (for example) the DOM library work internally? What does it _do_? — Peter, Feb 18 '11 at 18:53
@Peter: the DOM library is free software, you can just read it's source-code if you really want to find out. — hakre, Dec 13 '11 at 08:55

How do HTML parsers work?

1 Answers1