0

I want to remove specific tags from a HTML string. I am using HtmlAgility, but that removes entire nodes. I want to 'enhance' it to keep the innerHtml. It's all working but I have serious performance issues. This made me change the string.replace by a regex.replace and it is already 4 times faster. The replacement needs to be caseinsensitive. This is my current code:

var scrubHtmlTags = new[] {"strong","span","div","b","u","i","p","em","ul","ol","li","br"};
var stringToSearch = "LargeHtmlContent";
foreach (var stringToScrub in scrubHtmlTags)
{
   stringToSearch = Regex.Replace(stringToSearch, "<" + stringToScrub + ">", "", RegexOptions.IgnoreCase);
   stringToSearch = Regex.Replace(stringToSearch, "</" + stringToScrub + ">", "", RegexOptions.IgnoreCase);
}

There are still improvements however:

  1. It should be possible to get rid of < b > as well as < /b > in one run I assume...
  2. Is it possible to do all string replacements in one run?
Peter de Bruijn
  • 792
  • 6
  • 22
  • 3
    For the [2000th time](https://www.google.fr/search?q=site%3Ahttp%3A%2F%2Fstackoverflow.com+"Use+an+HTML+parser"), use an HTML parser. – Thomas Ayoub Jun 23 '16 at 13:22
  • Your code has many worse problems than performance. It doesn't work on any tags with incorrect spacing (as you found), class attributes, style attributes, id attributes, script attributes, etc. http://stackoverflow.com/questions/56107/what-is-the-best-way-to-parse-html-in-c – Dan Bechard Jun 23 '16 at 13:23
  • Try `Regex.Replace(stringToSearch, "?(?:" + string.Join("|", scrubHtmlTags) + ")>", "", RegexOptions.IgnoreCase);` Having this you only need one replace. – Sebastian Schumann Jun 23 '16 at 13:23
  • I use HtmlAgiltiy for some other stuff but HtmlAgility removes the entire node (sometext), i want sometext to remain. I am trying to enhance HtmlAgility... – Peter de Bruijn Jun 23 '16 at 13:24
  • Also, there's some question as to why you're doing this to begin with. You can't arbitrarily remove `
    `, `
      `, etc., keep the inner contents and expect to end up with valid HTML.
    – Dan Bechard Jun 23 '16 at 13:27
  • 1
    You know already [this famous answer](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454)? – Tim Schmelter Jun 23 '16 at 13:39
  • Tim, no, but you should also read the answer of Kaitlin on the same page. It describes my case... – Peter de Bruijn Jun 23 '16 at 13:46

3 Answers3

1

To do it in one run you can use this:

stringToSearch = Regex.Replace(stringToSearch, "<\\/?" + string.Format("(?:{0})", string.Join("|", scrubHtmlTags)) + ".*?>", "", RegexOptions.IgnoreCase);

But keep in mind that this may fail on several cases.

Thomas Ayoub
  • 29,063
  • 15
  • 95
  • 142
  • Correction: This _will_ fail in _most_ cases. – Dan Bechard Jun 23 '16 at 13:34
  • @Dan Why will this fail in most cases? OP said that all works as expected except performance. Nobody said that this should result in valid HTML. Please explain why it will fail anyways. – Sebastian Schumann Jun 23 '16 at 13:36
  • @Dan this will remove all HTML tags that are in the provided list if they doesn't contain spaces or attributes – Thomas Ayoub Jun 23 '16 at 13:37
  • @Verarind I'm making the assumption that `"LargeHtmlContent"` contains, well, HTML. OP does not specify that this is any particular subset of HTML. Therefore, OPs assumption that "it's all working" is incorrect. – Dan Bechard Jun 23 '16 at 13:39
  • I allready fixed the issues mentioned above by running it through HtmlAgility. The code of Thomas works and is almost twice as fast a the current code. I have used this mechanism for three years now and never had any problems. It was only the performance that was an issue. – Peter de Bruijn Jun 23 '16 at 13:41
  • The HTML content that is used is already stipped of any unwanted content and tags. – Peter de Bruijn Jun 23 '16 at 13:42
1

If I were your manager ... (koff, koff) ... I would reject your code and tell you, nay, require(!) you, to "listen to Thomas Ayoub," in his #1 post to the first entry on this thread.   You are well on your way to creating completely-unmaintainable code here:   code that was written because it seemed, to someone who wasn’t talking to anyone else, to have “solved” the immediate problem that s/he faced at the time.

Going back to your original task-description, you say that you “wish to remove specific tags from an HTML string.”   You further state that you are already using HtmlAgility (good, good ...), but then you object(!) that it “removes entire nodes.”

“ ’scuse me, but ...” exactly what did you expect it to do?   A “tag,” I surmise, is a (DOM) “node.”

So, faced with what you call “a performance problem,” instead of(!) questing for the inevitable bug(!!) in your previous code, you decided to throw caution to the four winds, and to thereby inflict it upon the project and the rest of the team.

And that, as an old-phart Manager, would be where I would step in.

I would exercise my “authority has its privileges” and instruct you ... order you ... to abandon your present approach and to go back to find-and-fix the bugs in your original approach.   But, going one step further, I would order you first to “find” the bugs, then to present your proposed(!) solution to the Team and to me, before authorizing you (by Team consensus) to implement your proposed fix.

(And I would like to think that, after you spent a suitable amount of time “calling me an a**hole” (of course ...), you would come to understand why I responded in this way, and why I took the time to say as much on Stack-whatever.com.)

Mike Robinson
  • 8,490
  • 5
  • 28
  • 41
  • Did your manager approve to spend this time to write this (alas entertaining and truthful) essay? :D – Dejan Jun 23 '16 at 14:15
-1

You might try this:

foreach (var stringToScrub in scrubHtmlTags)
{
   stringToSearch = Regex.Replace(
       stringToSearch, 
       "<\/?" + stringToScrub + ">", "", 
       RegexOptions.IgnoreCase);
}

But I would try to use one expressions to remove them all.

Jeroen van Langen
  • 21,446
  • 3
  • 42
  • 57
  • Wasn't even finished? wtf – Jeroen van Langen Jun 23 '16 at 13:28
  • Regular expressions *should never be used to parse HTML*. For all of the reasons mentioned in the comments above, and in the answers of hundreds of duplicates of this question on this site. Any RegEx solution is going to be a) grossly incomplete and b) enabling a dangerous and unnecessary habit. – Dan Bechard Jun 23 '16 at 13:30
  • I think it isn't that bad to remove `html like` patterns from a string. A html parser might be too bloated for this task. It's not that the user in question needs to build a complete DOM tree, replace several tags or interpret the document. So the -1 is kinda oversaturated – Jeroen van Langen Jun 23 '16 at 13:37
  • 2
    _"Wasn't even finished"_ So why did you post it? – Tim Schmelter Jun 23 '16 at 13:40
  • @JeroenvanLangen If the OP wants to remove "`html like` patterns", this is not the way to do it. His list is *far* from an exhaustive list of HTML patterns. – Dan Bechard Jun 23 '16 at 13:42
  • tip for the next time... tnx and didn't expect a vote within 2 seconds.. So i edited it.. – Jeroen van Langen Jun 23 '16 at 13:42