Is my idea for creating html parser from scratch going to work?

Question

In order to practice my skills im going to write an html parser. The idea I have in mind:

Define what I want to tokenize via regex.
Accept some html as a string.
Loop through html string.
Save information about the token such as content and position as an object.
If token has another token then that token is a child object of the parent token.
Finish object graph.
Create appropiate getters and setters.

Would you say this makes sense?

Should read famous answer [You can't parse HTML with regexl](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) — charlietfl, Jan 16 '17 at 05:44
The [description of an HTML parser](https://html.spec.whatwg.org/multipage/syntax.html#parsing) in the HTML specification is character-based and uses state machines, so I would start by looking at that for inspiration. — Blender, Jan 16 '17 at 05:48

score 1 · Answer 1 · edited May 23 '17 at 11:45

1

Your best bet would be to use a state machine or a tokeniser based implementation.

You can also read more about parsing HTML5 in the HTML5 specification.

edited May 23 '17 at 11:45

Community

answered Jan 16 '17 at 05:48

NJH

How do I tokenize text though? I cant think of anything other than some simple regex or using split – Asperger Jan 16 '17 at 05:53
I just need to define delimiters like white space – Asperger Jan 16 '17 at 05:55

1 Answers1