First of all, note that something like <div>abc</div> 123 <div>xyz</div>
is a valid HTML fragment.
Checking for your requirements and checking if an HTML string is something that would commonly be referred to as “valid” are two very different things.
Your requirements ask for a function that I’m going to call htmlStringHasElementOrWhitespaceChildrenAndNoUnknownElements
.
Because, what you’re looking for is
- a function that takes a string (an HTML string, presumably), and returns a boolean based on if the HTML string has certain properties. (
htmlStringHas
…)
- Those properties are:
- The Nodes, when parsed from the string, consist of either Elements, or of Text nodes which contain only whitespace. These Nodes are all at the root of the parsed structure.1 (…
ElementOrWhitespaceRoots
…)
- The Elements are all defined in HTML. (…
AndNoUnknownElements
)
This is a function that checks for these properties:
const htmlStringHasElementOrWhitespaceRootsAndNoUnknownElements = (string) => {
const parsed = new DOMParser().parseFromString(string, "text/html").body;
return Array.from(parsed.childNodes)
.every(({ nodeType, textContent }) => (nodeType === Document.ELEMENT_NODE || nodeType === Document.TEXT_NODE) && (nodeType !== Document.TEXT_NODE || !textContent.trim()))
&& Array.from(parsed.querySelectorAll("*"))
.every((node) => !(node instanceof HTMLUnknownElement));
};
console.log(htmlStringHasElementOrWhitespaceChildrenAndNoUnknownElements("<br><span>test</span> test <div>aa8<x></x><y>asd</y></div>")); // false
every
is used to check validity on every Node.
Alternatively, if you want to remove those “invalid” nodes, use filter
and call the remove
method (either for Elements or for CharacterData nodes, which Texts inherit from) on each node using forEach
:
Array.from(parsed.childNodes)
.filter(({ nodeType, textContent }) => (nodeType !== Document.ELEMENT_NODE && nodeType !== Document.TEXT_NODE) || (nodeType === Document.TEXT_NODE && textContent.trim()))
.concat(Array.from(parsed.querySelectorAll("*"))
.filter((node) => node instanceof HTMLUnknownElement))
.forEach((node) => node.remove());
I’ve started by filtering the set of valid nodes, then negated the predicate, and simplified using De Morgan’s laws.
Since the function name is unwieldy, let’s abbreviate that to validHTMLString
for now, although you must document what you define as “valid”.
Some test cases:
validHTMLString("<div></div><div></div>"); // true
validHTMLString("<x></x>"); // false
validHTMLString("<div><x></x></div>"); // false
validHTMLString("<img/> <span>test</span>"); // true
validHTMLString("a <div>b</div> c"); // false
Please note that there are some major caveats with this:
First, you’ve been asking about “valid” HTML for a while, but usually “valid HTML” means “conforms to the HTML specification”, which can be checked by an HTML validator.
This is non-trivial to check by yourself, since DOMParser
will apply exactly the same fixes to broken HTML that your browser will apply for any website it encounters.
Something like validHTMLString("<p><p></p></p><input></input><span>")
will therefore result in true
, despite containing three errors (or four errors, as the validator counts).
But DOMParser
is the best tool we have, other than writing our own validator from scratch or searching for an existing one.
Regular expressions are guaranteed to be insufficient for the purpose of validating arbitrary HTML strings.
You could attempt comparing the result of serializing the parsed result with the original string, but the serialization includes unrelated fixes which don’t cause a validation error. Example: tags like <img/>
are serialized as <img />
.
Second, custom elements exist.
Something like <my-element>
may be a valid element, with its own class, derived from HTMLElement
, after it has been defined.
1: When DOMParser
parses HTML, it will try to create a valid HTML document. Your root nodes are the childNodes
of the created body
.
ss
` is valid html – kevinSpaceyIsKeyserSöze Nov 01 '21 at 21:08