-1

When I write this HTML document:

<!DOCTYPE html>
<html>
    <head>
        <meta charset="UTF-8">
        <title>Test</title>
    </head>
    <body>
        <p>
            <div>Example</div>
        </p>
    </body>
</html>

My web browser parses the code into a DOM tree such that the contents of the <body> subtree is:

<p></p>
<div>Example</div>
<p></p>

(Tested in Mozilla Firefox 79, Google Chrome 84, and Microsoft Internet Explorer 11.)

Why does this structural change happen? How can I force a <div> to be inside a <p>?


Screenshot

Nayuki
  • 17,911
  • 6
  • 53
  • 80
  • Before anyone marks this as a duplicate of other questions, I provide more context in this thread - such as showing the DOM tree and mentioning how SGML fits into the interpretation. – Nayuki Aug 08 '20 at 17:19
  • 3
    We mark questions as duplicate, not answers. If you know this has been asked before, which seems likely from the comment above, you should be posting new answers to the duplicate, not asking the question again – Clive Aug 08 '20 at 17:25
  • Related: https://stackoverflow.com/questions/8397852/why-cant-the-p-tag-contain-a-div-tag-inside-it ; https://stackoverflow.com/questions/4291467/nesting-block-level-elements-inside-the-p-tag-right-or-wrong ; https://stackoverflow.com/questions/4967976/what-are-the-allowed-tags-inside-a-li ; https://stackoverflow.com/questions/5997254/where-in-the-world-are-are-the-html-nesting-rules – Nayuki Aug 13 '20 at 20:27

2 Answers2

0

In the beginning, there was the Standard Generalized Markup Language (SGML). SGML defined some aspects of the syntax like punctuation and tags, but each user application defined parts of the syntax such as tag names, attributes, nesting.

Decades later, SGML was simplified to create the XML standard. The way XML is used today for many application-specific data formats is similar to how SGML was used in the past. SGML and XML are essentially meta-languages - they are a syntax template for many application-specific languages.

HTML was initially designed as an application of SGML, hence understanding the history of HTML requires knowledge of some rules of SGML. SGML was intended to be editable in a text editor, so it included many features that reduced code to make human writing and reading more convenient. Just a few examples:

  • Some elements like <br> are self-terminating, thus never have a corresponding </br> end tag.
  • Some elements like <tbody> are implicitly inserted, e.g. <table><tr><td></td></tr></table> becomes <table><tbody><tr><td></td></tr></tbody></table>.
  • Some elements like <p> cannot nest in each other, so starting one will terminate the old one: <p><p> becomes <p></p><p></p>.

These element/tag-level syntax features are enabled/disabled through the SGML declaration and document type definition (DTD). HTML up to version 4.01 certainly had a DTD, and this was considered as the source of truth on how a parser should interpret markup code. The DTD can also tell us things like (not an exhaustive list):

  • What attributes each element is allowed to have.
  • Whether an attribute is optional, required, or has a default value.
  • Distinctions between PCDATA and CDATA, which affects how characters are escaped.
  • Exactly what elements are allowed to nest within what.

The DTD is where we can find our answer, at least historically speaking for HTML 4.01 Strict:

<!ELEMENT P - O (%inline;)*            -- paragraph -->

<!ENTITY % inline "#PCDATA | %fontstyle; | %phrase; | %special; | %formctrl;">

<!ENTITY % fontstyle
 "TT | I | B | BIG | SMALL">

<!ENTITY % phrase "EM | STRONG | DFN | CODE |
                   SAMP | KBD | VAR | CITE | ABBR | ACRONYM" >

<!ENTITY % special
   "A | IMG | OBJECT | BR | SCRIPT | MAP | Q | SUB | SUP | SPAN | BDO">

<!ENTITY % formctrl "INPUT | SELECT | TEXTAREA | LABEL | BUTTON">

The code above says that a <p> element can only contain %inline content, which is further defined as any of #PCDATA, %fontstyle, %phrase, %special, %formctrl. The definitions of the latter 4 are a set of 31 elements like <tt>, <strong>, <img>, <textarea>, etc. Notice that these so-called inline elements do not include block elements like <div>, <ul>, and so on - so in other words, <p> cannot contain <div>.

I don't know how the details of how the SGML parser behaves in every situation, but it looks like when one element is not allowed to contain another, the first element is terminated and then the second element begins. This explains why <p><div></div></p> becomes <p></p><div></div><p></p>.

Fast forward to HTML5, which is not based on SGML anymore. Although HTML5 is a bespoke, one-of-a-kind syntax standard, it is intended to be backward-compatible with HTML 4. HTML5 replicates the semantics of correct HTML 4 code, and additionally mandates a uniform way to parse erroneous markup code ("tag soup") so that all browsers behave the same. So the interpretation of <p><div></div></p> is still unchanged from the SGML days.

For <p> in particular, the rule is explained very clearly here here:

A p element's end tag can be omitted if the p element is immediately followed by an address, article, aside, blockquote, details, div, ...

Also, <p> is only allowed to contain "phrasing content" (note the lack of <div>):

Phrasing content is the text of the document, as well as elements that mark up that text at the intra-paragraph level. Runs of phrasing content form paragraphs. a, abbr, area (if it is a descendant of a map element), audio, b, bdi, bdo, br, button, canvas, cite, code, data, datalist, del, dfn, em, embed, i, [...], autonomous custom elements, text

Nayuki
  • 17,911
  • 6
  • 53
  • 80
  • 1
    *I don't know how the details of how the SGML parser behaves in every situation, but it looks like when one element is not allowed to contain another, the first element is terminated and then the second element begins.* Indeed it does, and not only does an SGML parser close the context div element to accommodate a new div, it also considers other actions allowed by the content model for the parent element, such as closing additional elements, or opening contextually required ones. I've put up details about parsing HTML5 using SGML in [HTML5 DTD Reference](http://sgmljs.net/docs/html5.html). – imhotap Aug 09 '20 at 08:59
0

Another answer explains why you can't nest <div> inside <p> in HTML code. This answer explains how you can do it by bending the rules.

JavaScript code can manipulate the HTML page's DOM, and you can easily create structures that are legal DOM trees but impossible to express in HTML code (due to the parser's behavior).

XHTML5 is basically HTML5 expressed in strict XML syntax. As long as the code parses without errors, the DOM tree exactly corresponds to the code. Some consequences:

  • There are no self-terminating elements like <br>; this must be written as either <br/> or <br></br>.
  • No elements will be implicitly inserted, like <tbody> between <table> and <tr>.
  • No elements will be implicitly closed just because HTML doesn't allow them to be nested. In XHTML, <p><p></p></p> is perfectly legal.

Here is a correct XHTML document that demonstrates <div> in <p>, with no trickery involved:

<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <meta charset="UTF-8"/>
    </head>
    <body>
        <p>
            <div>Example</div>
        </p>
    </body>
</html>
Nayuki
  • 17,911
  • 6
  • 53
  • 80