-1

I try to test if a string contains some HTML text with some specific properties:

  • Everything at the top level needs to be wrapped in a tag, so "<div>abc</div><div>xyz</div>" is valid, but "<div>abc</div> 123 <div>xyz</div>" is not. Whitespace between tags is fine.
  • Every tag needs to be an existing HTML tag, so "<div></div><x></x>" or "<div><x></x></div>" are both invalid since <x></x> is an unknown tag.

console.log(/(&lt;|<)br\s*\/?(&gt;|>)|(<([A-Za-z][A-Za-z0-9]*)\b[^>]*>(.*?)<\/\1>)/.test('<br/> span>test<span>'))

// test <br/> -> (&lt;|<)br\s*\/?(&gt;|>)
// test the rest tags -> (<([A-Za-z][A-Za-z0-9]*)\b[^>]*>(.*?)<\/\1>)

Also, I tried using DOMParser:

function isValidHTML(html) {
  const parser = new DOMParser();
  const doc = parser.parseFromString(html, "text/html");

  if (doc.documentElement.querySelector("parsererror")) {
    return doc.documentElement.querySelector("parsererror").innerText;
  } else {
    return true;
  }
}

console.log(isValidHTML("<span>test</span> 123 <p>ss</p>"))

Here, I expect an error, but it returns true.

According to the code, I expect to get false, because my code is not “valid” HTML. How to fix the code?

Sebastian Simon
  • 18,263
  • 7
  • 55
  • 75
Asking
  • 3,487
  • 11
  • 51
  • 106

1 Answers1

2

First of all, note that something like <div>abc</div> 123 <div>xyz</div> is a valid HTML fragment. Checking for your requirements and checking if an HTML string is something that would commonly be referred to as “valid” are two very different things.

Your requirements ask for a function that I’m going to call htmlStringHasElementOrWhitespaceChildrenAndNoUnknownElements. Because, what you’re looking for is

  • a function that takes a string (an HTML string, presumably), and returns a boolean based on if the HTML string has certain properties. (htmlStringHas…)
  • Those properties are:
    • The Nodes, when parsed from the string, consist of either Elements, or of Text nodes which contain only whitespace. These Nodes are all at the root of the parsed structure.1 (…ElementOrWhitespaceRoots…)
    • The Elements are all defined in HTML. (…AndNoUnknownElements)

This is a function that checks for these properties:

const htmlStringHasElementOrWhitespaceRootsAndNoUnknownElements = (string) => {
    const parsed = new DOMParser().parseFromString(string, "text/html").body;
    
    return Array.from(parsed.childNodes)
        .every(({ nodeType, textContent }) => (nodeType === Document.ELEMENT_NODE || nodeType === Document.TEXT_NODE) && (nodeType !== Document.TEXT_NODE || !textContent.trim()))
      && Array.from(parsed.querySelectorAll("*"))
        .every((node) => !(node instanceof HTMLUnknownElement));
  };

console.log(htmlStringHasElementOrWhitespaceChildrenAndNoUnknownElements("<br><span>test</span> test <div>aa8<x></x><y>asd</y></div>")); // false

every is used to check validity on every Node.

Alternatively, if you want to remove those “invalid” nodes, use filter and call the remove method (either for Elements or for CharacterData nodes, which Texts inherit from) on each node using forEach:

Array.from(parsed.childNodes)
  .filter(({ nodeType, textContent }) => (nodeType !== Document.ELEMENT_NODE && nodeType !== Document.TEXT_NODE) || (nodeType === Document.TEXT_NODE && textContent.trim()))
  .concat(Array.from(parsed.querySelectorAll("*"))
    .filter((node) => node instanceof HTMLUnknownElement))
  .forEach((node) => node.remove());

I’ve started by filtering the set of valid nodes, then negated the predicate, and simplified using De Morgan’s laws.

Since the function name is unwieldy, let’s abbreviate that to validHTMLString for now, although you must document what you define as “valid”.

Some test cases:

validHTMLString("<div></div><div></div>"); // true
validHTMLString("<x></x>"); // false
validHTMLString("<div><x></x></div>"); // false
validHTMLString("<img/> <span>test</span>"); // true
validHTMLString("a <div>b</div> c"); // false

Please note that there are some major caveats with this:

First, you’ve been asking about “valid” HTML for a while, but usually “valid HTML” means “conforms to the HTML specification”, which can be checked by an HTML validator. This is non-trivial to check by yourself, since DOMParser will apply exactly the same fixes to broken HTML that your browser will apply for any website it encounters. Something like validHTMLString("<p><p></p></p><input></input><span>") will therefore result in true, despite containing three errors (or four errors, as the validator counts). But DOMParser is the best tool we have, other than writing our own validator from scratch or searching for an existing one. Regular expressions are guaranteed to be insufficient for the purpose of validating arbitrary HTML strings.

You could attempt comparing the result of serializing the parsed result with the original string, but the serialization includes unrelated fixes which don’t cause a validation error. Example: tags like <img/> are serialized as <img />.

Second, custom elements exist. Something like <my-element> may be a valid element, with its own class, derived from HTMLElement, after it has been defined.


1: When DOMParser parses HTML, it will try to create a valid HTML document. Your root nodes are the childNodes of the created body.

Sebastian Simon
  • 18,263
  • 7
  • 55
  • 75