Don't ever use regexes to parse HTML. You might be able to ensure it does'n contain javascript, but you can't ensure it won't be horribly broken in other ways. Instead, use a proper parser.
Also, even valid HTML that doesn't contain javascript can still contain other unpleasant elements (audio, video, CSS nodes, form elements...), I recommend using a whitelist for the HTML elements that you do allow.
Here's an example of how your code could look like (note that even though it's supposed to be pseudocode, this might actually be proper C# syntax):
string[] tagWhitelist = ['strong', 'em', 'span' /*, ...*/];
string[] attrWhitelist = [/*...*/];
void function fixNode(DOMNode node, bool dieOnError){
if(tagWhitelist.contains(node.type()){
node.children.each((x) => fixNode(x))
node.attributes
.filter((x) => !attrWhitelist.contains(x))
.each((x) => dieOnError ? throw new InvalidTagException() : x.remove())
}else{
dieOnError ? throw new InvalidAttrException() : node.remove()
}
}
...
string output = fixNode(DOMParser.load(input, {strict:false}), false).toString();
This can also be used for validation, but only if the parser is able to throw an exception on invalid HTML (the ones I've worked with always try to fix the code):
try{
// note: if fixNode is only ever used to validate, don't use exceptions
fixNode(DOMParser.load(input, {strict:true}), true);
return true;
}catch(InvalidTagException, InvalidAttrException ex){
return false;
}
Update: the code you have linked in the comment claims to do exactly this, but I cannot guarantee it actually does.