Why does OWASP whitelist HTML as opposed to a blacklist trust approach?

Question

Updated bad terminology

I'm looking at JSoup and the OWASP Java HTML sanitizer project. I'm only interested in such a tool for the purposes of preventing XSS attacks by sanitizing user input passed to the API layer. The OWASP project says

"Passing 95+% of AntiSamy's unit tests plus many more."

But, it doesn't tell me where I can see these tests myself. What do these tests cover? More simply, I want to know why these said tools are defaulting to whitelist trust.

I'm sure there is a reason for their choosing whitelisting vs blacklisting. I want to disallow only known XSS unsafe tags like script and attributes such as on*. The blacklist approach does not even seem possible.

I need to know what the reasoning is for this and I suspect it's in the tests. For example, why disallow style tags? Is it dangerous in terms of XSS or does it exist for some other reason? (style can be XSS unsafe as mentioned in the comments: XSS attacks and style attributes)

I'm looking for more XSS unsafe justifications for other tags. The unit tests themselves should be enough if somebody knows where to find them. Given enough unsafe tags, this should tell me why a whitelist approach is necessary.

Your use of "blacklist" and "whitelist" is confusing to me. A "blacklist" mechanism would permit everything by default, and require the user to explicitly list things they want to prohibit. A "whitelist" would deny everything by default, and require the user to list things they want to allow. OWASP Java HTML sanitizer uses white listing; everything is denied by default, and you build a policy by allowing specified markup. — erickson, Sep 27 '17 at 20:10
@erickson - Got it, my terminology is wacky. Thanks for the link, that is very helpful. If someone can provide a link to the aforementioned `AntiSamy's unit tests` then this question would be 100% answered. — P.Brian.Mackey, Sep 27 '17 at 20:19

score 3 · Accepted Answer · answered Sep 28 '17 at 12:32

The original antisamy tests are in AntiSamyTest (antisamy).

They were adapted for owasp in AntiSamyTest (owasp).

They contain the tests against different html fragments, for example:

assertSanitizedDoesNotContain("<TABLE BACKGROUND=\"javascript:alert('XSS')\">", "background");

assertSanitizedDoesNotContain("<META HTTP-EQUIV=\"refresh\" CONTENT=\"0;url=data:text/html;base64,PHNjcmlwdD5hbGVydCgnWFNTJyk8L3NjcmlwdD4K\">", "<meta");

See the XSS Evasion Cheat Sheet for some more examples.

We tried blacklists but we kept finding new tags or attributes to use to bypass the blacklist, or malformed html and other encodings were used by bypass filters, making blacklists impractical and ineffective. So now the default assumption is that if a tag, attribute, or style isn't explicitly specified as safe, then it's unsafe. This protects not just against the xss attacks we already know about, but many new tyes as well.

Why does OWASP whitelist HTML as opposed to a blacklist trust approach?

1 Answers1