1

In order to practice my skills im going to write an html parser. The idea I have in mind:

  • Define what I want to tokenize via regex.
  • Accept some html as a string.
  • Loop through html string.
  • Save information about the token such as content and position as an object.
  • If token has another token then that token is a child object of the parent token.
  • Finish object graph.

  • Create appropiate getters and setters.

Would you say this makes sense?

Asperger
  • 3,064
  • 8
  • 52
  • 100
  • Should read famous answer [You can't parse HTML with regexl](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – charlietfl Jan 16 '17 at 05:44
  • 1
    The [description of an HTML parser](https://html.spec.whatwg.org/multipage/syntax.html#parsing) in the HTML specification is character-based and uses state machines, so I would start by looking at that for inspiration. – Blender Jan 16 '17 at 05:48

1 Answers1

1

Regular expressions aren't a good fit for heavy HTML parsing such as this; regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML.

Your best bet would be to use a state machine or a tokeniser based implementation.

You can also read more about parsing HTML5 in the HTML5 specification.

Community
  • 1
  • 1
NJH
  • 387
  • 4
  • 18