38

I need a reliable JavaScript library / function to check if an HTML snippet is valid that I can call from my code. For example, it should check that opened tags and quotation marks are closed, nesting is correct, etc.

I don't want the validation to fail because something is not 100% standard (but would work anyway).

Penny Liu
  • 15,447
  • 5
  • 79
  • 98
User
  • 31,811
  • 40
  • 131
  • 232

9 Answers9

43

Update: this answer is limited - please see the edit below.

Expanding on @kolink's answer, I use:

var checkHTML = function(html) {
  var doc = document.createElement('div');
  doc.innerHTML = html;
  return ( doc.innerHTML === html );
}

I.e., we create a temporary div with the HTML. In order to do this, the browser will create a DOM tree based on the HTML string, which may involve closing tags etc.

Comparing the div's HTML contents with the original HTML will tell us if the browser needed to change anything.

checkHTML('<a>hell<b>o</b>')

Returns false.

checkHTML('<a>hell<b>o</b></a>')

Returns true.

Edit: As @Quentin notes below, this is excessively strict for a variety of reasons: browsers will often fix omitted closing tags, even if closing tags are optional for that tag. Eg:

<p>one para
<p>second para

...is considered valid (since Ps are allowed to omit closing tags) but checkHTML will return false. Browsers will also normalise tag cases, and alter white space. You should be aware of these limits when deciding to use this approach.

mikemaccana
  • 110,530
  • 99
  • 389
  • 494
  • 1
    That doesn't actually work. Take `checkHTML("

    Test

    test")` for instance. That is perfectly valid HTML, but the browser will normalize it when it pulls it back out of `innerHTML`.

    – Quentin Feb 06 '15 at 11:19
  • There will be no text outside an element in the DOM generated from that valid HTML. – Quentin Feb 06 '15 at 11:28
  • What invalid HTML? The end tag for `p` elements is optional in HTML. The browser doesn't *correct* it, it *normalizes* it (by including end tags when it generates new HTML from the DOM, not by adding more paragraphs). This makes the approach in this answer useless for testing if the HTML is valid. – Quentin Feb 06 '15 at 11:34
  • 2
    Even if the `

    ` was a `

    ` it would still have been valid HTML. You can have text nodes in a div. That was just an example anyway. It will change which quotes are used around attribute values. Lower case attribute and element names. Alter white space. etc. etc. etc. It will generate *lots* of false positives.
    – Quentin Feb 06 '15 at 11:38
  • @Quentin yeah just noticed the case thing too. Have amended the note to say explicitly this is excessively strict. – mikemaccana Feb 06 '15 at 11:43
  • The problem you'll also come into this is that browsers can return different even though it's not wrong. `
    woops

    sad

    ` is valid in latest Safari it will return false because it will actually return the `

    ` rather than without the quotes. Whereas IE8 would return the `

    ` stripping the quotes.
    – Nick White Jun 26 '15 at 18:04
  • @trainoasis well it's actually not just IE8 safari has the same type of problem using this scenario. – Nick White Aug 11 '15 at 10:37
  • I can't emphasize false positives from whitespace differences enough, especially between tags. – hlfcoding Dec 15 '15 at 23:27
  • I have worked through the spaces and double quotes. Main problem now is using '&'. Such as,
    as ampersands are escaped
    – Chexpir Jul 06 '16 at 09:14
  • Also doesn't work with `` which `innerHTML` turns into ``, so then the input `html` doesn't match. – Chad Johnson Apr 13 '21 at 17:07
20

Well, this code:

function tidy(html) {
    var d = document.createElement('div');
    d.innerHTML = html;
    return d.innerHTML;
}

This will "correct" malformed HTML to the best of the browser's ability. If that's helpful to you, it's a lot easier than trying to validate HTML.

Niet the Dark Absol
  • 320,036
  • 81
  • 464
  • 592
  • Well, I need to make the user makes the input again if it's wrong (copy paste error). Correcting myself (using tidy for example) could make that the HTML is valid, but doesn't work properly. – User Apr 07 '12 at 14:34
  • As well as correcting malformed HTML, it also normalises valid HTML. – Quentin Feb 06 '15 at 11:36
13

None of the solutions presented so far is doing a good job in answering the original question, especially when it comes to

I don't want the validation to fail because something is not 100% standard (but would work anyways).

tldr >> check the JSFiddle

So I used the input of the answers and comments on this topic and created a method that does the following:

  • checks html string tag by tag if valid
  • trys to render html string
  • compares theoretically to be created tag count with actually rendered html dom tag count
  • if checked 'strict', <br/> and empty attribute normalizations ="" are not ignored
  • compares rendered innerHTML with given html string (while ignoring whitespaces and quotes)

Returns

  • true if rendered html is same as given html string
  • false if one of the checks fails
  • normalized html string if rendered html seems valid but is not equal to given html string

normalized means, that on rendering, the browser ignores or repairs sometimes specific parts of the input (like adding missing closing-tags for <p> and converts others (like single to double quotes or encoding of ampersands). Making a distinction between "failed" and "normalized" allows to flag the content to the user as "this will not be rendered as you might expect it".

Most times normalized gives back an only slightly altered version of the original html string - still, sometimes the result is quite different. So this should be used e.g. to flag user-input for further review before saving it to a db or rendering it blindly. (see JSFiddle for examples of normalization)

The checks take the following exceptions into consideration

  • ignoring of normalization of single quotes to double quotes
  • image and other tags with a src attribute are 'disarmed' during rendering
  • (if non strict) ignoring of <br/> >> <br> conversion
  • (if non strict) ignoring of normalization of empty attributes (<p disabled> >> <p disabled="">)
  • encoding of initially un-encoded ampersands when reading .innerHTML, e.g. in attribute values

.

function simpleValidateHtmlStr(htmlStr, strictBoolean) {
  if (typeof htmlStr !== "string")
    return false;

  var validateHtmlTag = new RegExp("<[a-z]+(\s+|\"[^\"]*\"\s?|'[^']*'\s?|[^'\">])*>", "igm"),
    sdom = document.createElement('div'),
    noSrcNoAmpHtmlStr = htmlStr
      .replace(/ src=/igm, " svhs___src=") // disarm src attributes
      .replace(/&amp;/igm, "#svhs#amp##"), // 'save' encoded ampersands
    noSrcNoAmpIgnoreScriptContentHtmlStr = noSrcNoAmpHtmlStr
      .replace(/\n\r?/igm, "#svhs#nl##") // temporarily remove line breaks
      .replace(/(<script[^>]*>)(.*?)(<\/script>)/igm, "$1$3") // ignore script contents
      .replace(/#svhs#nl##/igm, "\n\r"),  // re-add line breaks
    htmlTags = noSrcNoAmpIgnoreScriptContentHtmlStr.match(/<[a-z]+[^>]*>/igm), // get all start-tags
    htmlTagsCount = htmlTags ? htmlTags.length : 0,
    tagsAreValid, resHtmlStr;
        
    
  if(!strictBoolean){
    // ignore <br/> conversions
    noSrcNoAmpHtmlStr = noSrcNoAmpHtmlStr.replace(/<br\s*\/>/, "<br>")
  }

  if (htmlTagsCount) {
    tagsAreValid = htmlTags.reduce(function(isValid, tagStr) {
      return isValid && tagStr.match(validateHtmlTag);
    }, true);

    if (!tagsAreValid) {
      return false;
    }
  }


  try {
    sdom.innerHTML = noSrcNoAmpHtmlStr;
  } catch (err) {
    return false;
  }

  // compare rendered tag-count with expected tag-count
  if (sdom.querySelectorAll("*").length !== htmlTagsCount) {
    return false;
  }

  resHtmlStr = sdom.innerHTML.replace(/&amp;/igm, "&"); // undo '&' encoding
  
  if(!strictBoolean){
    // ignore empty attribute normalizations
    resHtmlStr = resHtmlStr.replace(/=""/, "")
  }

  // compare html strings while ignoring case, quote-changes, trailing spaces
  var
    simpleIn = noSrcNoAmpHtmlStr.replace(/["']/igm, "").replace(/\s+/igm, " ").toLowerCase().trim(),
    simpleOut = resHtmlStr.replace(/["']/igm, "").replace(/\s+/igm, " ").toLowerCase().trim();
  if (simpleIn === simpleOut)
    return true;

  return resHtmlStr.replace(/ svhs___src=/igm, " src=").replace(/#svhs#amp##/, "&amp;");
}

Here you can find it in a JSFiddle https://jsfiddle.net/abernh/twgj8bev/ , together with different test-cases, including

"<a href='blue.html id='green'>missing attribute quotes</a>" // FAIL
"<a>hell<B>o</B></a>"                                        // PASS
'<a href="test.html">hell<b>o</b></a>'                       // PASS
'<a href=test.html>hell<b>o</b></a>',                        // PASS
"<a href='test.html'>hell<b>o</b></a>",                      // PASS
'<ul><li>hell</li><li>hell</li></ul>',                       // PASS
'<ul><li>hell<li>hell</ul>',                                 // PASS
'<div ng-if="true && valid">ampersands in attributes</div>'  // PASS

.

foobored
  • 316
  • 3
  • 8
9

9 years later, how about using DOMParser?

It accepts string as parameter and returns Document type, just like HTML. Thus, when it has an error, the returned document object has <parsererror> element in it.

If you parse your html as xml, at least you can check your html is xhtml compliant.

Example

> const parser = new DOMParser();
> const doc = parser.parseFromString('<div>Input: <input /></div>', 'text/xml');
> (doc.documentElement.querySelector('parsererror') || {}).innerText; // undefined

To wrap this as a function

function isValidHTML(html) {
  const parser = new DOMParser();
  const doc = parser.parseFromString(html, 'text/xml');
  if (doc.documentElement.querySelector('parsererror')) {
    return doc.documentElement.querySelector('parsererror').innerText;
  } else {
    return true;
  }
}

Testing the above function

isValidHTML('<a>hell<B>o</B></a>') // true
isValidHTML('<a href="test.html">hell</a>') // true
isValidHTML('<a href='test.html'>hell</a>') // true
isValidHTML("<a href=test.html>hell</a>")  // This page contains the following err..
isValidHTML('<ul><li>a</li><li>b</li></ul>') // true
isValidHTML('<ul><li>a<li>b</ul>') // This page contains the following err..
isValidHTML('<div><input /></div>' // true
isValidHTML('<div><input></div>' // This page contains the following err..

The above works for very simple html. However if your html has some code-like texts; <script>, <style>, etc, you need to manipulate just for XML validation although it's valid HTML

The following updates code-like html to a valid XML syntax.

export function getHtmlError(html) {
  const parser = new DOMParser();
  const htmlForParser = `<xml>${html}</xml>`
    .replace(/(src|href)=".*?&.*?"/g, '$1="OMITTED"')
    .replace(/<script[\s\S]+?<\/script>/gm, '<script>OMITTED</script>')
    .replace(/<style[\s\S]+?<\/style>/gm, '<style>OMITTED</style>')
    .replace(/<pre[\s\S]+?<\/pre>/gm, '<pre>OMITTED</pre>')
    .replace(/&nbsp;/g, '&#160;');

  const doc = parser.parseFromString(htmlForParser, 'text/xml');
  if (doc.documentElement.querySelector('parsererror')) {
    console.error(htmlForParser.split(/\n/).map( (el, ndx) => `${ndx+1}: ${el}`).join('\n'));
    return doc.documentElement.querySelector('parsererror');
  }
}
allenhwkim
  • 27,270
  • 18
  • 89
  • 122
  • 2
    It seems that isValidHTML() function does not work correctly in Firefox... even though Firefox throws an XML error to console when passed an invalid HTML test such as isValidHTML('test'); the function still falls into the else block and returns TRUE always. I think it still works in Chrome. Is there some other way to check other than "doc.documentElement.querySelector('parsererror')" ? – geogan Apr 05 '22 at 13:32
3
function validHTML(html) {
  var openingTags, closingTags;

  html        = html.replace(/<[^>]*\/\s?>/g, '');      // Remove all self closing tags
  html        = html.replace(/<(br|hr|img).*?>/g, '');  // Remove all <br>, <hr>, and <img> tags
  openingTags = html.match(/<[^\/].*?>/g) || [];        // Get remaining opening tags
  closingTags = html.match(/<\/.+?>/g) || [];           // Get remaining closing tags

  return openingTags.length === closingTags.length ? true : false;
}

var htmlContent = "<p>your html content goes here</p>" // Note: String without any html tag will consider as valid html snippet. If it’s not valid in your case, in that case you can check opening tag count first.

if(validHTML(htmlContent)) {
  alert('Valid HTML')
}
else {
  alert('Invalid HTML');
}
Tarun
  • 1,888
  • 3
  • 18
  • 30
0

Using pure JavaScript you may check if an element exists using the following function:

if (typeof(element) != 'undefined' && element != null)

Using the following code we can test this in action:

HTML:

<input type="button" value="Toggle .not-undefined" onclick="toggleNotUndefined()">
<input type="button" value="Check if .not-undefined exists" onclick="checkNotUndefined()">
<p class=".not-undefined"></p>

CSS:

p:after {
    content: "Is 'undefined'";
    color: blue;
}
p.not-undefined:after {
    content: "Is not 'undefined'";
    color: red;
}

JavaScript:

function checkNotUndefined(){
    var phrase = "not ";
    var element = document.querySelector('.not-undefined');
    if (typeof(element) != 'undefined' && element != null) phrase = "";
    alert("Element of class '.not-undefined' does "+phrase+"exist!");
    // $(".thisClass").length checks to see if our elem exists in jQuery
}

function toggleNotUndefined(){
    document.querySelector('p').classList.toggle('not-undefined');
}

It can be found on JSFiddle.

0
function isHTML(str)
{
 var a = document.createElement('div');
 a.innerHTML = str;
 for(var c= a.ChildNodes, i = c.length; i--)
 {
    if (c[i].nodeType == 1) return true;
 }
return false;
}

Good Luck!

David Castro
  • 1,773
  • 21
  • 21
0

It depends on js-library which you use.

Html validatod for node.js https://www.npmjs.com/package/html-validator

Html validator for jQuery https://api.jquery.com/jquery.parsehtml/

But, as mentioned before, using the browser to validate broken HTML is a great idea:

function tidy(html) {
    var d = document.createElement('div');
    d.innerHTML = html;
    return d.innerHTML;
}
Eugene Kaurov
  • 2,356
  • 28
  • 39
0

Expanding on @Tarun's answer from above:

function validHTML(html) { // checks the validity of html, requires all tags and property-names to only use alphabetical characters and numbers (and hyphens, underscore for properties)
    html = html.toLowerCase().replace(/(?<=<[^>]+?=\s*"[^"]*)[<>]/g,"").replace(/(?<=<[^>]+?=\s*'[^']*)[<>]/g,""); // remove all angle brackets from tag properties
    html = html.replace(/<script.*?<\/script>/g, '');  // Remove all script-elements
    html = html.replace(/<style.*?<\/style>/g, '');  // Remove all style elements tags
    html = html.toLowerCase().replace(/<[^>]*\/\s?>/g, '');      // Remove all self closing tags
    html = html.replace(/<(\!|br|hr|img).*?>/g, '');  // Remove all <br>, <hr>, and <img> tags
    //var tags=[...str.matchAll(/<.*?>/g)]; this would allow for unclosed initial and final tag to pass parsing
    html = html.replace(/^[^<>]+|[^<>]+$|(?<=>)[^<>]+(?=<)/gs,""); // remove all clean text nodes, note that < or > in text nodes will result in artefacts for which we check and return false
    tags = html.split(/(?<=>)(?=<)/);
    if (tags.length%2==1) {
        console.log("uneven number of tags in "+html)
        return false;
    }
    var tagno=0;
    while (tags.length>0) {
        if (tagno==tags.length) {
            console.log("these tags are not closed: "+tags.slice(0,tagno).join());
            return false;
        }
        if (tags[tagno].slice(0,2)=="</") {
            if (tagno==0) {
                console.log("this tag has not been opened: "+tags[0]);
                return false;
            }
            var tagSearch=tags[tagno].match(/<\/\s*([\w\-\_]+)\s*>/);
            if (tagSearch===null) {
                console.log("could not identify closing tag "+tags[tagno]+" after "+tags.slice(0,tagno).join());
                return false;
            } else tags[tagno]=tagSearch[1];
            if (tags[tagno]==tags[tagno-1]) {
                tags.splice(tagno-1,2);
                tagno--;
            } else {
                console.log("tag '"+tags[tagno]+"' trying to close these tags: "+tags.slice(0,tagno).join());
                return false;
            }
        } else {
            tags[tagno]=tags[tagno].replace(/(?<=<\s*[\w_\-]+)(\s+[\w\_\-]+(\s*=\s*(".*?"|'.*?'|[^\s\="'<>`]+))?)*/g,""); // remove all correct properties from tag
            var tagSearch=tags[tagno].match(/<(\s*[\w\-\_]+)/);
            if ((tagSearch===null) || (tags[tagno]!="<"+tagSearch[1]+">")) {
                console.log("fragmented tag with the following remains: "+tags[tagno]);
                return false;
            }
            var tagSearch=tags[tagno].match(/<\s*([\w\-\_]+)/);
            if (tagSearch===null) {
                console.log("could not identify opening tag "+tags[tagno]+" after "+tags.slice(0,tagno).join());
                return false;
            } else tags[tagno]=tagSearch[1];
            tagno++;
        }
    }
    return true;
}

This performs a few additional checks, such as testing whether tags match and whether properties would parse. As it does not depend on an existing DOM, it can be used in a server environment, but beware: it is slow. Also, in theory, tags can be names much more laxly, as you can basically use any unicode (with a few exceptions) in tag- and property-names. This would not pass my own sanity-check, however.

mheim
  • 366
  • 1
  • 12