Our legacy application was initially designed to allow storing any html tags for customizable pages. The idea of such pages are to store any html what customers need and then at some point there is possibility to render this html data.
This approach allowed users to store any XSS. Our current goal is to define and enforce policy that won't allow to store any XSS.
We have looked into few approaches that are able to sanitazize html, based on some predefined rules:
But both approaches are based on sanitazation not validation. So basic scenario can look like following:
- User type some data to input
- User input is sanitazed and checked for equality with raw(initial - step 1) user input.
- In case any difference the validation considered failed.
This approach works for new data. In case of legacy data we would have several problems:
- In case user legacy data contains forbidden elements user won't be able to save slightly modified version of html content.
- Following flow will confuse user:
- User edits legacy data that contains forbidden tags/content regarding newly defined policy.
- User replaces all content and saves it.
- User for some reason decides to revert to old version
- User is not allowed to save previous version as it contains forbidden tags/content.
So following questions appear:
- What is the best way to validate user input for malicious html elements and XSS vectors?
- Which approach can be used to fix mentioned problems with legacy data?