3

On Google Chrome (Canary), it seems no string can make the DOM parser fail. I'm trying to parse some HTML, but if the HTML isn't completely, 100%, valid, I want it to display an error. I've tried the obvious:

var newElement = document.createElement('div');
newElement.innerHTML = someMarkup; // Might fail on IE, never on Chrome.

I've also tried the method in this question. Doesn't fail for invalid markup, even the most invalid markup I can produce.

So, is there some way to parse HTML "strictly" in Google Chrome at least? I don't want to resort to tokenizing it myself or using an external validation utility. If there's no other alternative, a strict XML parser is fine, but certain elements don't require closing tags in HTML, and preferably those shouldn't fail.

Community
  • 1
  • 1
Ry-
  • 218,210
  • 55
  • 464
  • 476
  • "strict" in JavaScript has a [specific meaning](http://es5.github.com/#C), so I've edited the title of your question – T.J. Crowder Feb 19 '12 at 22:20
  • 1
    *"...certain elements don't require closing tags in HTML..."* Some elements [don't require **opening** tags](http://www.w3.org/TR/html5/syntax.html#optional-tags), either. – T.J. Crowder Feb 19 '12 at 22:21
  • tried it with HTML doctype strict? – powtac Feb 19 '12 at 22:21
  • @powtac: I'm trying to parse HTML fragments - no DTD. – Ry- Feb 19 '12 at 22:23
  • @T.J.Crowder: Okay - but the question remains :) – Ry- Feb 19 '12 at 22:24
  • This might offer some clues (it's an extension, but I imagine it might help if there's source code): http://robertnyman.com/html-validator/ – Jared Farrish Feb 19 '12 at 22:32
  • @JaredFarrish: It delegates to the W3C validation service, I believe. – Ry- Feb 19 '12 at 22:37
  • Yeah, I noticed that (not sure about the inline one). At least initially, JS-Beautify seems to provide a number of errors: https://github.com/einars/js-beautify/blob/master/tests/sanitytest.js and it seems to use an internal tokenizer (although can't tell if it's server-based or if [this file](https://github.com/einars/js-beautify/blob/master/beautify.js) performs it in-browser). I don't want to gum up the comments with any more suggestions though, so hopefully that helps. – Jared Farrish Feb 19 '12 at 22:41

1 Answers1

6

Use the DOMParser to check a document in two steps:

  1. Validate whether the document is XML-conforming, by parsing it as XML.
  2. Parse the string as HTML. This requires a modification on the DOMParser.
    Loop through each element, and check whether the DOM element is an instance of HTMLUnknownElement. For this purpose, getElementsByTagName('*') fits well.
    (If you want to strictly parse the document, you have to recursively loop through each element, and remember whether the element is allowed to be placed at that location. Eg. <area> in <map>)

Demo: http://jsfiddle.net/q66Ep/1/

/* DOM parser for text/html, see https://stackoverflow.com/a/9251106/938089 */
;(function(DOMParser) {"use strict";var DOMParser_proto=DOMParser.prototype,real_parseFromString=DOMParser_proto.parseFromString;try{if((new DOMParser).parseFromString("", "text/html"))return;}catch(e){}DOMParser_proto.parseFromString=function(markup,type){if(/^\s*text\/html\s*(;|$)/i.test(type)){var doc=document.implementation.createHTMLDocument(""),doc_elt=doc.documentElement,first_elt;doc_elt.innerHTML=markup;first_elt=doc_elt.firstElementChild;if (doc_elt.childElementCount===1&&first_elt.localName.toLowerCase()==="html")doc.replaceChild(first_elt,doc_elt);return doc;}else{return real_parseFromString.apply(this, arguments);}};}(DOMParser));

/*
 * @description              Validate a HTML string
 * @param       String html  The HTML string to be validated 
 * @returns            null  If the string is not wellformed XML
 *                    false  If the string contains an unknown element
 *                     true  If the string satisfies both conditions
 */
function validateHTML(html) {
    var parser = new DOMParser()
      , d = parser.parseFromString('<?xml version="1.0"?>'+html,'text/xml')
      , allnodes;
    if (d.querySelector('parsererror')) {
        console.log('Not welformed HTML (XML)!');
        return null;
    } else {
        /* To use text/html, see https://stackoverflow.com/a/9251106/938089 */
        d = parser.parseFromString(html, 'text/html');
        allnodes = d.getElementsByTagName('*');
        for (var i=allnodes.length-1; i>=0; i--) {
            if (allnodes[i] instanceof HTMLUnknownElement) return false;
        }
    }
    return true; /* The document is syntactically correct, all tags are closed */
}

console.log(validateHTML('<div>'));  //  null, because of the missing close tag
console.log(validateHTML('<x></x>'));// false, because it's not a HTML element
console.log(validateHTML('<a></a>'));//  true, because the tag is closed,
                                     //       and the element is a HTML element

See revision 1 of this answer for an alternative to XML validation without the DOMParser.

Considerations

  • The current method completely ignores the doctype, for validation.
  • This method returns null for <input type="text">, while it's valid HTML5 (because the tag is not closed).
  • Conformance is not checked.
Community
  • 1
  • 1
Rob W
  • 341,306
  • 83
  • 791
  • 678
  • It should be easier to parse XML with `DOMParser()`/`ActiveXObject('Microsoft.XMLDOM')`. Your construction doesn't validate against a DTD (or XML Schema), it only tries to parse this string as XML (and if it fails it throws an not well-formed error). In addition, at least Firefox uses a non-validating parser. – Saxoier Feb 20 '12 at 00:07
  • You should rename your function to `parseXHTML` or something similar. Parsing a SGMLDocument (or HTML5Document) isn't that simple. Your solution will return false at valid HTML strings like `

    foo` and and true at invalid strings like `

  • `. [`document.querySelectorAll` is unbelievable slow compared to `document.getElementsByTagName`](http://jsperf.com/get-all-elements) – Saxoier Feb 20 '12 at 15:05
  • @Saxoier I had already adressed both inabilities at the top of the answer. I have also added them at the bottom of the answer, in case you don't see it. As for QSA vs GTA, that's true. – Rob W Feb 20 '12 at 15:18
  • I know you wrote it at the top. But why do you call this function `validateHTML`. Neither does the function except HTML nor can it validate XML (or XHTML). – Saxoier Feb 20 '12 at 15:30
  • @Saxoier This can be used as a base for HTML validation, as pointed out at note 2. However, I am not going to write and test a full validator, because that's time-consuming, and not going to be anything special. – Rob W Feb 20 '12 at 15:33
  • Sorry, I didn't get notified about this answer for some reason :S Thank you, I've found a different way since, but I'll mark this as the answer. – Ry- Mar 02 '12 at 19:30