There are fancy solutions involving utilizing the browser itself to attempt to parse the text, identifying if any DOM nodes were constructed, which will be… slow. Or regular expressions which will be faster, but… potentially inaccurate. There are also two very distinct questions arising from this problem:
Q1: Does a string contain HTML fragments?
Is the string part of an HTML document, containing HTML element markup or encoded entities? This can be used as an indicator that the string may require bleaching / sanitization or entity decoding:
/</?[a-z][^>]*>|(\&(?:[\w\d]+|#\d+|#x[a-f\d]+);/
You can see this pattern in use against all of the examples from all existing answers at the time of this writing, plus some… rather hideous WYSIWYG- or Word-generated sample text and a variety of character entity references.
Q2: Is the string an HTML document?
The HTML specification is shockingly loose as to what it considers an HTML document. Browsers go to extreme lengths to parse almost any garbage text as HTML. Two approaches: either just consider everything HTML (since if delivered with a text/html
Content-Type, great effort will be expended to try to interpret it as HTML by the user-agent) or look for the prefix marker:
<!DOCTYPE html>
In terms of "well-formedness", that, and almost nothing else is "required". The following is a 100% complete, fully valid HTML document containing every HTML element you think is being omitted:
<!DOCTYPE html>
<title>Yes, really.</title>
<p>This is everything you need.
Yup. There are explicit rules on how to form "missing" elements such as <html>
, <head>
, and <body>
. Though I find it rather amusing that SO's syntax highlighting failed to detect that properly without an explicit hint.