Validate input HTML using JavaScript

Question

I need to validate HTML user input in a web App using JavaScript.

What I did so far based on this question: I'm using third party library, sanitize-html, to sanitize input and then compare it to original one. If they are different, Html is invalid.

const isValidHtml = (html: string): boolean => {
    let sanitized = sanitizeHtml(html, sanitizationConfig);
    sanitized = sanitized.replace(/\s/g, '').replace(/<br>|<br\/>/g, ''); // different browser's behavior for <br>
    html = html.replace(/\s/g, '').replace(/<br>|<br\/>/g, '');
    return sanitized === html;
}

The above method works fine with unescaped Html but not with escaped ones.

isValidHtml('<'); // false
isValidHtml('&lt;'); // true
isValidHtml('<script>'); // false
isValidHtml('&lt;script&gt;'); // true, this should be false also!!!

Am I missing something with this method?
Is there a better way to do this task?

EDIT: As suggested by @brad in the comments, I tried to decode Html first:

decodeHtml(html: string): string {
    const txt = document.createElement('textarea');
    txt.innerHTML = html;
    const decodedHtml = txt.value;
    txt.textContent = null;
    return decodedHtml;
}

and then call isValid(decodedHtml), I got this result:

isValidHtml('<'); // false
isValidHtml('&lt;'); // false, this should be true!!!
isValidHtml('<script>'); // false
isValidHtml('&lt;script&gt;'); // false

Why not just let the browser parse it, and then re-serialize the DOM to HTML? Whatever you do, RegEx isn't the answer. — Brad, Nov 21 '18 at 03:40
@Brad If I do so, `<` will be decoded as `<` and `sanitizeHtml` method will return empty string. Which means `isValid('<')` returns false — Mhd, Nov 21 '18 at 03:50
@Brad I did update my question, is that what you are suggestion? — Mhd, Nov 21 '18 at 13:43

Brad · Answer 1 · 2018-11-21T14:26:25.787

If you're not actually trying to validate the HTML, and are simply trying to ensure it ends up being valid, I would recommend running it through the DOM parser and getting the HTML back out, effectively letting the browser do the work for you.

Untested, but something like this:

const parser = new DOMParser();
const doc = parser.parseFromString(html, 'text/html');
console.log(doc.documentElement.innerHTML);

Basically, you use the browser's built-in parsing to handle any errors, in the standard way that it does anyway. It will create a tree of nodes. From that tree of nodes, you generate HTML that is guaranteed to be valid.

See also: https://developer.mozilla.org/en-US/docs/Web/API/DOMParser#Parsing_an_SVG_or_HTML_document

Validate input HTML using JavaScript

1 Answers1