Use an HTML Parser
Your requirements are straightforward:
- You must disallow all
<script>
tags, but keep certain rich HTML tags.
- You must be able to escape inline Javascript in links. i.e. stringify it or strip the unsafe attributes altogether.
The correct way to handle all of these is to employ a modern standards-compliant HTML parser that is able to syntactically analyse the structure of the rich HTML sent over, identifying the tags sent over and discovering the raw values in attributes. This is, in fact, how sanitisation, as one of the comments mentions, is done.
There are a number of pre-existing HTML parsers that are designed to target XSS-unsafe input. The npm
library js-xss
, for example, appears to be able to do exactly what you want:
You can even run this server-side as a command line utility.
Similar libraries already exist for most languages, and you should do a thorough search of your preferred language's package repository. Alternatively, you can launch a subprocess and collect your results directly from js-xss
from the command line.
Avoid using regular expressions to parse HTML naively - while it is true most HTML parsers end up using regular expressions under the hood, they do so in a fairly limited fashion for strictly well-defined grammars after correctly lexing them.
`, `