Check if an HTML string only has element children (or whitespace between elements) and no element is unknown

Question

I try to test if a string contains some HTML text with some specific properties:

Everything at the top level needs to be wrapped in a tag, so "<div>abc</div><div>xyz</div>" is valid, but "<div>abc</div> 123 <div>xyz</div>" is not. Whitespace between tags is fine.
Every tag needs to be an existing HTML tag, so "<div></div><x></x>" or "<div><x></x></div>" are both invalid since <x></x> is an unknown tag.

console.log(/(&lt;|<)br\s*\/?(&gt;|>)|(<([A-Za-z][A-Za-z0-9]*)\b[^>]*>(.*?)<\/\1>)/.test('<br/> span>test<span>'))

// test <br/> -> (&lt;|<)br\s*\/?(&gt;|>)
// test the rest tags -> (<([A-Za-z][A-Za-z0-9]*)\b[^>]*>(.*?)<\/\1>)

Also, I tried using DOMParser:

function isValidHTML(html) {
  const parser = new DOMParser();
  const doc = parser.parseFromString(html, "text/html");

  if (doc.documentElement.querySelector("parsererror")) {
    return doc.documentElement.querySelector("parsererror").innerText;
  } else {
    return true;
  }
}

console.log(isValidHTML("<span>test</span> 123 <p>ss</p>"))

Here, I expect an error, but it returns true.

According to the code, I expect to get false, because my code is not “valid” HTML. How to fix the code?

How exactly do you define “valid”? Regular expressions are going to be the wrong tool to use. Use [`DOMParser`](//developer.mozilla.org/docs/Web/API/DOMParser) and [`XMLSerializer`](//developer.mozilla.org/docs/Web/API/XMLSerializer) instead. — Sebastian Simon, Nov 01 '21 at 16:00
[HTML cannot be parsed with RegEx](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) — Sean, Nov 01 '21 at 16:02
@SebastianSimon, i need to parse a string, and to check if it is a valid html, so i don't parse html, but string — Asking, Nov 01 '21 at 16:02
@SebastianSimon, i added an example in the question but it does not work, could you help plesae? — Asking, Nov 01 '21 at 16:36
@skyboyer, i changed, but anyway i get true, but `123` is not a tag, it does not have a tag wrapper. Why i get true? — Asking, Nov 01 '21 at 16:41
then I don't get what the definition of "valid html" means in this case. Text nodes are definitely valid for HTML. — skyboyer, Nov 01 '21 at 16:44
the example does not contain `` closing tag. That's why probably it interprets 123 as in a tag — ilkerkaran, Nov 01 '21 at 16:44
@skyboyer, i want to check if each string is wrapped in html tag. — Asking, Nov 01 '21 at 16:50
"i want to check if each string is wrapped in html tag" - that is a totally different requirement than what I think most of us understood so far... — Peter B, Nov 01 '21 at 16:51
@PeterB, what do you mean? I want to check if the string that contains html is a valid html, meaning all html tags are valid, all text have open and close tag. — Asking, Nov 01 '21 at 16:54

Sebastian Simon · Answer 1 · 2021-11-05T08:03:56.173

First of all, note that something like <div>abc</div> 123 <div>xyz</div> is a valid HTML fragment. Checking for your requirements and checking if an HTML string is something that would commonly be referred to as “valid” are two very different things.

Your requirements ask for a function that I’m going to call htmlStringHasElementOrWhitespaceChildrenAndNoUnknownElements. Because, what you’re looking for is

a function that takes a string (an HTML string, presumably), and returns a boolean based on if the HTML string has certain properties. (htmlStringHas…)
Those properties are:
- The Nodes, when parsed from the string, consist of either Elements, or of Text nodes which contain only whitespace. These Nodes are all at the root of the parsed structure.¹ (…ElementOrWhitespaceRoots…)
- The Elements are all defined in HTML. (…AndNoUnknownElements)

This is a function that checks for these properties:

const htmlStringHasElementOrWhitespaceRootsAndNoUnknownElements = (string) => {
    const parsed = new DOMParser().parseFromString(string, "text/html").body;
    
    return Array.from(parsed.childNodes)
        .every(({ nodeType, textContent }) => (nodeType === Document.ELEMENT_NODE || nodeType === Document.TEXT_NODE) && (nodeType !== Document.TEXT_NODE || !textContent.trim()))
      && Array.from(parsed.querySelectorAll("*"))
        .every((node) => !(node instanceof HTMLUnknownElement));
  };

console.log(htmlStringHasElementOrWhitespaceChildrenAndNoUnknownElements("<br><span>test</span> test <div>aa8<x></x><y>asd</y></div>")); // false

every is used to check validity on every Node.

Alternatively, if you want to remove those “invalid” nodes, use filter and call the remove method (either for Elements or for CharacterData nodes, which Texts inherit from) on each node using forEach:

Array.from(parsed.childNodes)
  .filter(({ nodeType, textContent }) => (nodeType !== Document.ELEMENT_NODE && nodeType !== Document.TEXT_NODE) || (nodeType === Document.TEXT_NODE && textContent.trim()))
  .concat(Array.from(parsed.querySelectorAll("*"))
    .filter((node) => node instanceof HTMLUnknownElement))
  .forEach((node) => node.remove());

I’ve started by filtering the set of valid nodes, then negated the predicate, and simplified using De Morgan’s laws.

Since the function name is unwieldy, let’s abbreviate that to validHTMLString for now, although you must document what you define as “valid”.

Some test cases:

validHTMLString("<div></div><div></div>"); // true
validHTMLString("<x></x>"); // false
validHTMLString("<div><x></x></div>"); // false
validHTMLString("<img/> <span>test</span>"); // true
validHTMLString("a <div>b</div> c"); // false

Please note that there are some major caveats with this:

First, you’ve been asking about “valid” HTML for a while, but usually “valid HTML” means “conforms to the HTML specification”, which can be checked by an HTML validator. This is non-trivial to check by yourself, since DOMParser will apply exactly the same fixes to broken HTML that your browser will apply for any website it encounters. Something like validHTMLString("<p><p></p></p><input></input><span>") will therefore result in true, despite containing three errors (or four errors, as the validator counts). But DOMParser is the best tool we have, other than writing our own validator from scratch or searching for an existing one. Regular expressions are guaranteed to be insufficient for the purpose of validating arbitrary HTML strings.

You could attempt comparing the result of serializing the parsed result with the original string, but the serialization includes unrelated fixes which don’t cause a validation error. Example: tags like <img/> are serialized as <img />.

Second, custom elements exist. Something like <my-element> may be a valid element, with its own class, derived from HTMLElement, after it has been defined.

¹: When DOMParser parses HTML, it will try to create a valid HTML document. Your root nodes are the childNodes of the created body.

Check if an HTML string only has element children (or whitespace between elements) and no element is unknown

1 Answers1

Linked

Related