HTML5 is not based on SGML, so what is it based on then?

Question

http://www.w3schools.com/tags/tag_doctype.asp

HTML5 is not based on SGML, and therefore does not require a reference to a DTD.

On what standard is HTML 5 based on if not on SGML?

Do not use w3schools as source of information, only for fun. See http://w3fools.com. The answer can be found in any real HTML5 material such as W3C HTML5 CR. — Jukka K. Korpela, Apr 24 '13 at 07:41
possible duplicate of [HTML5 is not based on SGML, and therefore does not require a reference to a DTD](http://stackoverflow.com/questions/16184832/html5-is-not-based-on-sgml-and-therefore-does-not-require-a-reference-to-a-dtd) — Jukka K. Korpela, Apr 24 '13 at 07:43

dtech · Accepted Answer · 2019-10-28T14:30:00.953

20

The HTML5 standard specifies two serializations of HTML5: "html" and "xml". "xml" is a valid XML serialization (which in turn is a subset of SGML). "html" is not based on any specific serialization standard anymore, it has its own complete serialization. Herein lies the difference: HTML4 has a "sgml" serialization and "xml" serialization (called XHTML 1.0)

Of course HTML5 is for a large part based on HTML4 (based on SGML) and XHTML (based on HTML4 and XML).

Also see the history section of the HTML5 specification

edited Oct 28 '19 at 14:30

answered Apr 24 '13 at 07:34

dtech

13,741
11
48
73

2

Because HTML5 explicitly allows proprietary tags without declaration there is no DTD and HTML5 is not based on SGML but is it´s own standard. The correct parsing method is currently not defined but this is AFAIK in progress. – cljk Apr 24 '13 at 07:42

score 2 · Answer 2 · answered Jul 19 '22 at 20:40

What is the HTML 5 standard based on?

It is based on what browsers actually do.

In 2002-2005 Ian Hickson went through every browser, and found every parsing edge case for the DOM tree they create when presented with some HTML.

For Example

For example, what should the DOM tree of this (invalid) HTML be:

<!DOCTYPE html><em><p>XY</p></em>

Browsers seemed to agree on the tree:

DOCTYPE: html
HTML
- HEAD
- BODY
  - EM
    - P
      - #text: XY

Even though it is invalid html, browsers were happy to parse it into what you meant. The last thing your browser should do refuse to display what is perfectly understandable HTML.

Now what about this invalid html:

<!DOCTYPE html><em><p>X</em>Y</p>

IE: Y is a child of both p and body. This violates the DOM spec (a note is supposed to have only one parent), but is what the author of the HTML wanted.

Opera: Makes a valid DOM tree, but X isn't emphasised - violating CSS spec.

Mozilla and Safari: make it a valid DOM tree, but Y isn't emphasised (which is what the author wanted)

DOCTYPE: html
HTML
- HEAD
  - BODY
    - EM
    - P
      - EM
        
        #text: X
      - #text: Y

Which means that different browsers had different ideas on how to handle HTML (hence the need for an HTML standard).

A parser can't say:

Well, HTML is supposed to be a subset of SGML. And if your HTML isn't well-formed, then the results are undefined.

Not good enough

The web needs a standard to reflect how browsers should parse HTML. The W3C wasn't doing it. They hated HTML, and wanted everyone to move their beautiful SGML version of HTML, an xml-ified version of HTML: xhtml.

The HTML 5 standard is meant to be used in the real world. There needs to be a definition on how to handle not well-formed HTML, and define how browsers should handle it. It was based on a survey of all existing implementations, and choosing what either a consensus is, or what a consensus should be.

Which brings us to HTML5

From the HTML5 spec, and they lay it out quite plainly:

While the HTML syntax described in this specification bears a close resemblance to SGML and XML, it is a separate language with its own parsing rules.

Some earlier versions of HTML (in particular from HTML2 to HTML4) were based on SGML and used SGML parsing rules. However, few (if any) web browsers ever implemented true SGML parsing for HTML documents; the only user agents to strictly handle HTML as an SGML application have historically been validators. The resulting confusion — with validators claiming documents to have one representation while widely deployed web browsers interoperably implemented a different representation — has wasted decades of productivity. This version of HTML thus returns to a non-SGML basis.

In other words (and they also say this):

An HTML5 parser is any parser that follows the parsing rules of HTML5

HTML5 has no grammer. There is no regex, lexer, BNF, EBNF you can use to parse HTML.

In order to correctly parse HTML to the HTML5 standard, you must implement the (very meticulously detailed) algorithm described in the HTML5 standard.

And if your parser doesn't handle invalid HTML: then that's the fault of your parser.

HTML5 is not based on SGML, so what is it based on then?

2 Answers2

For Example

Not good enough

Which brings us to HTML5

Linked