200

Is there any easy way to remove all HTML tags or ANYTHING HTML related from a string?

For example:

string title = "<b> Hulk Hogan's Celebrity Championship Wrestling &nbsp;&nbsp;&nbsp;<font color=\"#228b22\">[Proj # 206010]</font></b>&nbsp;&nbsp;&nbsp; (Reality Series, &nbsp;)"

The above should really be:

"Hulk Hogan's Celebrity Championship Wrestling [Proj # 206010] (Reality Series)"

JJ.
  • 9,580
  • 37
  • 116
  • 189
  • This question is closed due to duplication but suggested answer is given using Html Agility Pack. If you want to remove html tags with out using Html Agility pack you can refer my answer here http://stackoverflow.com/a/30026043/2318354 . Which may be helpful to some one – Dilip Langhanoja May 05 '15 at 10:55
  • 9
    This is not a duplicate, as "HTML agility pack - removing unwanted tags without removing content?" wants to keep some tags (ie, give a list of valid tags, remove the rest). This question here is about removing ALL tags. And I can't use the other question's answers as I'm not going to pass in a list of all html tags in existence. – Thierry_S Jan 18 '17 at 19:23
  • Take a look at [xidel](https://sourceforge.net/projects/xidel/). It will take you 95% of the way there with `xidel -s input -e '/'`. – vhs Apr 24 '20 at 19:19

5 Answers5

386

You can use a simple regex like this:

public static string StripHTML(string input)
{
   return Regex.Replace(input, "<.*?>", String.Empty);
}

Be aware that this solution has its own flaw. See Remove HTML tags in String for more information (especially the comments of 'Mark E. Haase'/@mehaase)

Another solution would be to use the HTML Agility Pack.
You can find an example using the library here: HTML agility pack - removing unwanted tags without removing content?

Michael Freidgeim
  • 26,542
  • 16
  • 152
  • 170
Bidou
  • 7,378
  • 9
  • 47
  • 70
84

You can parse the string using Html Agility pack and get the InnerText.

    HtmlDocument htmlDoc = new HtmlDocument();
    htmlDoc.LoadHtml(@"<b> Hulk Hogan's Celebrity Championship Wrestling &nbsp;&nbsp;&nbsp;<font color=\"#228b22\">[Proj # 206010]</font></b>&nbsp;&nbsp;&nbsp; (Reality Series, &nbsp;)");
    string result = htmlDoc.DocumentNode.InnerText;
ssilas777
  • 9,672
  • 4
  • 45
  • 68
  • 2
    I like the `InnerText` solution as it removes all tags. But... it leaves behind ` ` and also comment tags such as ` ` like those surrounding `v:shapetype`, `v:shape` or `v:imagedata` with `[if gte vml 1]` or `[if !vml]` – Thierry_S Jan 18 '17 at 19:26
  • 14
    I realize that ` ` is an html entity, not a tag, so a solution to remove that could be `result = WebUtility.HtmlDecode(result);` and to remove the comment nodes, using the Html Agility Pack: `htmlDoc.DocumentNode.SelectNodes("//comment()")?.ForEach(c=> c.Remove());` just before doing `result = htmlDoc.DocumentNode.InnerText;` – Thierry_S Jan 18 '17 at 20:15
7

You can use the below code on your string and you will get the complete string without html part.

string title = "<b> Hulk Hogan's Celebrity Championship Wrestling &nbsp;&nbsp;&nbsp;<font color=\"#228b22\">[Proj # 206010]</font></b>&nbsp;&nbsp;&nbsp; (Reality Series, &nbsp;)".Replace("&nbsp;",string.Empty);            
        string s = Regex.Replace(title, "<.*?>", String.Empty);
Vinay
  • 705
  • 2
  • 7
  • 22
1

I built a small function to remove HTML tags.

public static string RemoveHtmlTags(string text)
        {
            List<int> openTagIndexes = Regex.Matches(text, "<").Cast<Match>().Select(m => m.Index).ToList();
            List<int> closeTagIndexes = Regex.Matches(text, ">").Cast<Match>().Select(m => m.Index).ToList();
            if (closeTagIndexes.Count > 0)
            {
                StringBuilder sb = new StringBuilder();
                int previousIndex = 0;
                foreach (int closeTagIndex in closeTagIndexes)
                {
                    var openTagsSubset = openTagIndexes.Where(x => x >= previousIndex && x < closeTagIndex);
                    if (openTagsSubset.Count() > 0 && closeTagIndex - openTagsSubset.Max() > 1 )
                    {
                        sb.Append(text.Substring(previousIndex, openTagsSubset.Max() - previousIndex));
                    }
                    else
                    {
                        sb.Append(text.Substring(previousIndex, closeTagIndex - previousIndex + 1));
                    }
                    previousIndex = closeTagIndex + 1;
                }
                if (closeTagIndexes.Max() < text.Length)
                {
                    sb.Append(text.Substring(closeTagIndexes.Max() + 1));
                }
                return sb.ToString();
            }
            else
            {
                return text;
            }
        }
Jeff Qi
  • 41
  • 2
0
public static string StripHTML(string input)
{
    if (input==null)
    {
        return string.Empty;
    }
    return Regex.Replace(input, "<.*?>", String.Empty);

}
Shunya
  • 2,344
  • 4
  • 16
  • 28
  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the [help center](https://stackoverflow.com/help/how-to-answer). – Ethan Jul 30 '22 at 02:28
  • 1
    This solution is already provided – EGN Dec 29 '22 at 22:19