I want to remove specific tags from a HTML string. I am using HtmlAgility, but that removes entire nodes. I want to 'enhance' it to keep the innerHtml. It's all working but I have serious performance issues. This made me change the string.replace by a regex.replace and it is already 4 times faster. The replacement needs to be caseinsensitive. This is my current code:
var scrubHtmlTags = new[] {"strong","span","div","b","u","i","p","em","ul","ol","li","br"};
var stringToSearch = "LargeHtmlContent";
foreach (var stringToScrub in scrubHtmlTags)
{
stringToSearch = Regex.Replace(stringToSearch, "<" + stringToScrub + ">", "", RegexOptions.IgnoreCase);
stringToSearch = Regex.Replace(stringToSearch, "</" + stringToScrub + ">", "", RegexOptions.IgnoreCase);
}
There are still improvements however:
- It should be possible to get rid of < b > as well as < /b > in one run I assume...
- Is it possible to do all string replacements in one run?
`, etc., keep the inner contents and expect to end up with valid HTML.