Validating (not sanitizing) that user-supplied HTML only contains limited subset of tags in Python

Question

I am developing a web application using Python/Flask/SQLAlchemy on the server side.

I am using the wysihtml rich text editor to allow users to enter text with a very limited subset of HTML in it. While wysihtml sanitizes the resulting HTML on the client side, some kind of server-side checking is required to ensure that only that subset of HTML is accepted. To repeat, it not only should be valid HTML, I want it to only contain a very limited set of tags. Furthermore, it doesn't have to be a complete HTML document.

Furthermore, I would like to know when non-compliant HTML is submitted, as it is either a bug in the client-side validation, or a (likely malicious) attempt to bypass it indicating an attack.

I could use Bleach to sanitize the user supplied HTML, but that does not work as a validator (there is no easy way to tell whether the sanitized HTML has been substantively changed) , and the developer has made clear that he regards validation as outside the scope of his tool.

I have looked, but there doesn't appear to be a standard tool for doing validation in these circumstances.

I would prefer not to roll my own if I don't have to for two reasons: first, it will take extra time, and second, I don't want to run the risk of making rookie mistakes.

So can anybody point me to a standard method for doing this server-side in Python? And, if not, why doesn't one exist? Is the thinking behind my need for one misguided, and if so why?

FWIW I just googled "phyton HTML validation" and got quite some seamingly useful answers right from the first results page... — bruno desthuilliers, Oct 01 '15 at 09:46
@brunodesthuilliers , maybe I'm missing something but the methods discussed in those links you've mentioned validate against the entirety of the HTML specification. I don't want to accept *any* valid HTML document - I want to validate that it only contains a very small set of tags. Furthermore, it doesn't have to be a complete HTML document; it's the output from a rich text editor. — rgmerk, Oct 01 '15 at 11:03
most parsers know to parse fragments so that's a non-issue, but _validating_ against a given subset is indeed more specific. Now you could still use any parser and turn it into a validator instead of a sanitizer... — bruno desthuilliers, Oct 01 '15 at 11:52

Validating (not sanitizing) that user-supplied HTML only contains limited subset of tags in Python

0 Answers0