Creating my own html parser

Question

I know this post, I've already read it but still I'd like to learn what language does an html parser (may) use? I mean, does it parse the whole source with a regex or it uses a normal programming language such as c# or python?

Apart from the question above can you also brief me on from where I should start to create my own parser? (I'd like to create an html parser for my personal needs :)

You can use any [Turing complete](http://en.wikipedia.org/wiki/Turing_completeness) language. Regular expressions (at least those of [formal language theory](http://en.wikipedia.org/wiki/Regular_expression#Formal_language_theory)) aren’t. But most regular expression libraries and implementations are far more capable (see for example [Can extended regex implementations parse HTML?](http://stackoverflow.com/questions/4933611/can-extended-regex-implementations-parse-html)). — Gumbo, Jul 29 '11 at 18:12
Be sure to read this: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 A masterpiece of StackOverflow. — Iterator, Jul 29 '11 at 18:12
@Bart Kiers I am not facing a particular problem, I just like to learn new stuff. — Shaokan, Jul 29 '11 at 18:23
Ah, okay. I didn't interpret "for my personal needs" as a learning experience. — Bart Kiers, Jul 29 '11 at 18:24
@Gumbo: technically, you can use a pushdown automata. You don't need Turing completeness :D. And to Shaokan: the fact that HTML has a Context-Free Grammar makes any traditional programming language quite suitable. There are a variety of tools for building such parsers. I like Antlr with Java (or C# or python). If you want to build such a parser completely by hand, you should consult any reference on compiler implementation. Parsing CFGs is almost always well-discussed in compiler books. — ccoakley, Jul 29 '11 at 20:39
@ccoakley: But PDAs are equivalent to context-free grammars and not to regular expressions. — Gumbo, Jul 30 '11 at 06:29
@Gumbo: Yes. I wasn't picking that nit, only tightening the upper bound on which languages are appropriate. Parser generator languages based on BNF are not Turing complete, but they can be quite appropriate for the task. — ccoakley, Aug 01 '11 at 15:01
Does this answer your question? [Writing an HTML Parser](https://stackoverflow.com/questions/7192101/writing-an-html-parser) — ggorlen, Nov 11 '20 at 22:01

score 2 · Accepted Answer · answered Jul 29 '11 at 18:15

Python, Java, and Perl are all fine languages for learning to write an HTML parser. Perl is very pleasant for regular expressions, but that's not what you need for a parser. It is a bit more pleasant to write OO programs in Python or Java. C/C++/C#, etc., are also common, for very fast parsers. However, as a learning exercise, I recommend Python or Java, so that you can compare your work with standard parsers.

score 1 · Answer 2 · answered Aug 01 '11 at 17:22

The standard way is to use some Yacc/Lex duet; second makes a code that splits the code into tokens, first builds a code that converts a token stream into some desired structure.

There is also some more tempting option, Ragel. Here you just write a big regexp-like structure capable of matching entire file and define a hooks that will fire when a certain sub-pattern was matched.

Creating my own html parser

2 Answers2