You shouldn't use regular-expressions to validate HTML (let alone parse it) because HTML is not a "Regular Language".
So here's an example of a false-negative case which would cause any regular expression you could write to attempt to validate HTML to mark it as invalid:
<html>
<head>
<!-- </html> -->
</head>
<body>
<p>This is valid HTML</p>
</body>
</html>
And because you can nest comments in HTML (and SGML and XML) you can't write a straightforward regex for this particular case either:
<html>
<head>
<!-- <!-- <!-- <!-- </html> -->
</head>
<body>
<p>This is valid HTML</p>
</body>
</html>
And here's a false-positive (assuming you don't use the ^$
regex anchors):
<p>illegal element</p>
<html>
<img>illegal text node</img>
</html>
<p>another illegal element</p>
Granted, there are more powerful implementations of of regular-expressions that add rudiminary support for things like counting-depth, but then you're in for a world of hurt.
The correct way to validate HTML is to use a HTML DOM library. In .NET this is HtmlAgilityPack. In browser-based JavaScript it's even simpler: just use the browser's built-in parser (innerHTML
):
(stolen from Check if HTML snippet is valid with Javascript )
function isValidHtml(html) {
var doc = document.implementation.createHTMLDocuiment("");
doc.documentElement.innerHTML = html;
return ( doc.documentElement.innerHTML === html );
}