Strict HTML parser in JavaScript

Question

In HTML, block elements can't be children of inline elements. Browsers however are happy to accept this HTML:

<i>foo <h4>bar</h4> fizz</i>

and render it intuitively as expected; neither do they choke on it using DOMparser.

But it's not valid and is therefore hard to convert to another schema. Pandoc parses the above as (option1):

<i>foo </i><h4>bar</h4> fizz

which is at least valid but not faithful. Another approach would be (option2):

<i>foo </i><h4><i>bar</i></h4><i> fizz</i>

Is there a way to force DOMparser to do a more strict parsing that would result in option 1 or 2? (It doesn't seem possible).

Alternatively, what would be the best approach to deal with this, that is, given the first string, get option 1 or 2 as a result? Is there a JS parser that does this (and other strict enforcing of the standard)?

Edit: it turns out the HTML parser of at least Chrome (78.0.3904.108) behaves differently when the content is in a p instead of, say, a div. When the HTML above is in a p then it gets parsed as option 2! But it's left as is when inside a div.

So I guess the question is now: how to enforce the behavior of ps onto divs?

Does this answer your question? [Strict HTML parsing in JavaScript](https://stackoverflow.com/questions/9353791/strict-html-parsing-in-javascript) — Ouroborus, Dec 11 '19 at 00:14
Thanks but no; that other question is about trying to validate parsed HTML after the fact, and detect errors; I'm trying to find a parser that produces valid strict HTML on the first pass. — Ken, Dec 11 '19 at 11:12
That distinction isn't relevant here. Both questions want to get strict HTML from non-strict HTML via `DOMparser`. The answer is the same: No, there isn't a way to get `DOMparser` to do that, you'll need code or a library to do it. — Ouroborus, Dec 11 '19 at 19:52

Strict HTML parser in JavaScript

0 Answers0