http://www.w3schools.com/tags/tag_doctype.asp
HTML5 is not based on SGML, and therefore does not require a reference to a DTD.
On what standard is HTML 5 based on if not on SGML?
http://www.w3schools.com/tags/tag_doctype.asp
HTML5 is not based on SGML, and therefore does not require a reference to a DTD.
On what standard is HTML 5 based on if not on SGML?
The HTML5 standard specifies two serializations of HTML5: "html" and "xml". "xml" is a valid XML serialization (which in turn is a subset of SGML). "html" is not based on any specific serialization standard anymore, it has its own complete serialization. Herein lies the difference: HTML4 has a "sgml" serialization and "xml" serialization (called XHTML 1.0)
Of course HTML5 is for a large part based on HTML4 (based on SGML) and XHTML (based on HTML4 and XML).
Also see the history section of the HTML5 specification
What is the HTML 5 standard based on?
It is based on what browsers actually do.
In 2002-2005 Ian Hickson went through every browser, and found every parsing edge case for the DOM tree they create when presented with some HTML.
For example, what should the DOM tree of this (invalid) HTML be:
<!DOCTYPE html><em><p>XY</p></em>
Browsers seemed to agree on the tree:
Even though it is invalid html, browsers were happy to parse it into what you meant. The last thing your browser should do refuse to display what is perfectly understandable HTML.
Now what about this invalid html:
<!DOCTYPE html><em><p>X</em>Y</p>
IE: Y
is a child of both p
and body
. This violates the DOM spec (a note is supposed to have only one parent), but is what the author of the HTML wanted.
Opera: Makes a valid DOM tree, but X
isn't emphasised - violating CSS spec.
Mozilla and Safari: make it a valid DOM tree, but Y
isn't emphasised (which is what the author wanted)
Which means that different browsers had different ideas on how to handle HTML (hence the need for an HTML standard).
A parser can't say:
Well, HTML is supposed to be a subset of SGML. And if your HTML isn't well-formed, then the results are undefined.
The web needs a standard to reflect how browsers should parse HTML. The W3C wasn't doing it. They hated HTML, and wanted everyone to move their beautiful SGML version of HTML, an xml-ified version of HTML: xhtml.
The HTML 5 standard is meant to be used in the real world. There needs to be a definition on how to handle not well-formed HTML, and define how browsers should handle it. It was based on a survey of all existing implementations, and choosing what either a consensus is, or what a consensus should be.
From the HTML5 spec, and they lay it out quite plainly:
While the HTML syntax described in this specification bears a close resemblance to SGML and XML, it is a separate language with its own parsing rules.
Some earlier versions of HTML (in particular from HTML2 to HTML4) were based on SGML and used SGML parsing rules. However, few (if any) web browsers ever implemented true SGML parsing for HTML documents; the only user agents to strictly handle HTML as an SGML application have historically been validators. The resulting confusion — with validators claiming documents to have one representation while widely deployed web browsers interoperably implemented a different representation — has wasted decades of productivity. This version of HTML thus returns to a non-SGML basis.
In other words (and they also say this):
HTML5 has no grammer. There is no regex, lexer, BNF, EBNF you can use to parse HTML.
In order to correctly parse HTML to the HTML5 standard, you must implement the (very meticulously detailed) algorithm described in the HTML5 standard.
And if your parser doesn't handle invalid HTML: then that's the fault of your parser.