20

Possible Duplicate:
How to clean HTML tags using C#

What is the best way to strip HTML tags in C#?

Community
  • 1
  • 1
Mr.CSharp
  • 231
  • 1
  • 3
  • 5
  • 1
    Do you know which tags you want to strip? Or is it all? Even if the html tags change in the future do you still want the code to work? Will the input always be valid XHTML? –  Feb 25 '10 at 15:03
  • 1
    Duplicate: http://stackoverflow.com/questions/787932/using-c-regular-expressions-to-remove-html-tags http://stackoverflow.com/questions/785715/asp-net-strip-html-tags and http://stackoverflow.com/questions/1038431/how-to-clean-html-tags-using-c – George Stocker Feb 25 '10 at 15:19

3 Answers3

29
  public static string StripHTML(string htmlString)
  {

     string pattern = @"<(.|\n)*?>";

     return Regex.Replace(htmlString, pattern, string.Empty);
  }
Ivan G.
  • 5,027
  • 2
  • 37
  • 65
  • 6
    my pleasure, at your service, mam – Ivan G. Feb 25 '10 at 18:43
  • 5
    Ick, this question is repeated a lot across SO, and this same bad answer is repeated a lot, too. As I already said in another identical post: "You shouldn't use a regular expression to parse a context-free grammar like HTML. If the HTML is being provided by some external entity, then it can be easily manipulated to evade your regular expression." – Mark E. Haase Jul 09 '13 at 18:31
  • we're using htmlagilitypack now – Ivan G. Jul 12 '13 at 10:29
  • 1
    it depends what you want to achieve. HAP might be extremely slow to strip effectively a few millions of short strings when quality is not required. – Ivan G. Feb 13 '15 at 18:28
  • Removing tag by css class name string cssClassName = "myCSSClass"; string pattern = String.Format("
    ]+class=([""'])[^>]*{0}[^>]*\1[^>]*>(.|\n)*?
    ", cssClassName); Regex.Replace(htmlString, pattern, string.Empty);
    – M. Salah Feb 18 '19 at 07:12
  • Great answer, I have stolen it to post in https://stackoverflow.com/a/67812771/15920836 – StarshipladDev Jun 02 '21 at 21:24
7

Take your HTML string or document and parse it with HTML Agility Pack. This will give you a HTMLDocument object that is very similar to a XmlDocument.

You can then use it's methods such as SelectNodes to access those portions of the document that you are interested in.

If you choose to use another approach, be aware that parsing HTML (a non-Regular language) with Regular Expressions is widely regarded as a bad idea.

And regardless of the approach, if you are keeping some markup, use a whitelist approach. This means to remove everything that is not explicitly wanted.

Lachlan Roche
  • 25,678
  • 5
  • 79
  • 77
0

To guarantee that no HTML tags get through, use: HttpServerUtility.HtmlEncode(string);.

If you want some to get through, you can use this "Whitelist" approach.

Update: There has been some vulnerabilities found in that code; as a Developer from Fog Creek tells us.

(Second link includes code).

George Stocker
  • 57,289
  • 29
  • 176
  • 237
  • 11
    HTMLEncode("The image tag: ")) %> Output: The image tag: <img&gt which is not the same as Strip it. – Filip Ekberg Feb 25 '10 at 14:58
  • It all depends on the result he wants. If he wants to make sure that no HTML tags are ever executed (and thus open himself up to XSS), than the first way is the 'best' way. If he just wants to have plaintext come through, a variation of the second way is 'best'. – George Stocker Feb 25 '10 at 15:00
  • He might want to remove tags to display it as clear text in an rss-feed or something. In PHP you have a built in funciton called http://php.net/strip_tags which of the sound of it is what he wants. But the whitelist solves that, you could also use that HTML Pack or whatever it is called.. – Filip Ekberg Feb 25 '10 at 15:03
  • Actually, this approach is FAR more secure than the regex suggested above. The only drawback to this approach is that user's may not want to see encoded HTML. – Mark E. Haase Jul 09 '13 at 18:32
  • 1
    Links in answers are a bad idea because they sometimes break! – muttley91 Sep 22 '14 at 15:35
  • While interesting, it answers a different question, as it does not strip HTML tags. – Mickael Bergeron Néron Jun 13 '22 at 02:07