How do I remove all HTML tags from a string without knowing which tags are in it?

Question

Is there any easy way to remove all HTML tags or ANYTHING HTML related from a string?

For example:

string title = "<b> Hulk Hogan's Celebrity Championship Wrestling &nbsp;&nbsp;&nbsp;<font color=\"#228b22\">[Proj # 206010]</font></b>&nbsp;&nbsp;&nbsp; (Reality Series, &nbsp;)"

The above should really be:

"Hulk Hogan's Celebrity Championship Wrestling [Proj # 206010] (Reality Series)"

This question is closed due to duplication but suggested answer is given using Html Agility Pack. If you want to remove html tags with out using Html Agility pack you can refer my answer here http://stackoverflow.com/a/30026043/2318354 . Which may be helpful to some one — Dilip Langhanoja, May 05 '15 at 10:55
This is not a duplicate, as "HTML agility pack - removing unwanted tags without removing content?" wants to keep some tags (ie, give a list of valid tags, remove the rest). This question here is about removing ALL tags. And I can't use the other question's answers as I'm not going to pass in a list of all html tags in existence. — Thierry_S, Jan 18 '17 at 19:23
Take a look at [xidel](https://sourceforge.net/projects/xidel/). It will take you 95% of the way there with `xidel -s input -e '/'`. — vhs, Apr 24 '20 at 19:19

score 386 · Accepted Answer · edited Jun 22 '21 at 00:29

386

You can use a simple regex like this:

public static string StripHTML(string input)
{
   return Regex.Replace(input, "<.*?>", String.Empty);
}

Be aware that this solution has its own flaw. See Remove HTML tags in String for more information (especially the comments of 'Mark E. Haase'/@mehaase)

Another solution would be to use the HTML Agility Pack.
You can find an example using the library here: HTML agility pack - removing unwanted tags without removing content?

edited Jun 22 '21 at 00:29

Michael Freidgeim

26,542
16
152
170

answered Aug 09 '13 at 19:14

Bidou

7,378
9
47
70

3

Doesn't work for input: '7 < 10 but 30 > 10' it gives: '7 but 30 > 10' – Bartosz Pierzchlewicz Oct 03 '17 at 13:39
Yes, because it strips everything between < and >, so in your case, `< 10 ` and `` are both stripped. – Bidou Oct 03 '17 at 16:00
3

Shouldn't the method name be StripHtml() since method names should use Pascal case? – David Klempfner Apr 28 '18 at 07:39
1

Using regular expressions for this is probably not a good idea if you are using it for security reasons. – Mathias Lykkegaard Lorenzen Sep 19 '18 at 07:03
3

Just change the regex to <[a-zA-Z/]*?> – Brandon Prudent Nov 21 '18 at 19:50
4

@BrandonPrudent maybe better would be `<[a-zA-Z/].*?>` - it includes attributes – tarn Aug 20 '19 at 07:42
I think this Regex considers almost every case <[\/a-zA-Z0-9= \" \\" \' \\' #;:()$_-]*?> – Samy Sammour Feb 23 '22 at 10:11
1

C# code will be: Regex.Replace(input, "<[\\/a-zA-Z0-9= \"\\\"'\'#;:()$_-]*?>", string.Empty); – Samy Sammour Feb 23 '22 at 10:32

score 84 · Answer 2 · answered Aug 09 '13 at 19:21

84

You can parse the string using Html Agility pack and get the InnerText.

    HtmlDocument htmlDoc = new HtmlDocument();
    htmlDoc.LoadHtml(@"<b> Hulk Hogan's Celebrity Championship Wrestling &nbsp;&nbsp;&nbsp;<font color=\"#228b22\">[Proj # 206010]</font></b>&nbsp;&nbsp;&nbsp; (Reality Series, &nbsp;)");
    string result = htmlDoc.DocumentNode.InnerText;

answered Aug 09 '13 at 19:21

ssilas777

9,672
4
45
68

2

I like the `InnerText` solution as it removes all tags. But... it leaves behind ` ` and also comment tags such as ` ` like those surrounding `v:shapetype`, `v:shape` or `v:imagedata` with `[if gte vml 1]` or `[if !vml]` – Thierry_S Jan 18 '17 at 19:26
14

I realize that ` ` is an html entity, not a tag, so a solution to remove that could be `result = WebUtility.HtmlDecode(result);` and to remove the comment nodes, using the Html Agility Pack: `htmlDoc.DocumentNode.SelectNodes("//comment()")?.ForEach(c=> c.Remove());` just before doing `result = htmlDoc.DocumentNode.InnerText;` – Thierry_S Jan 18 '17 at 20:15

Vinay · Answer 3 · 2013-08-09T21:02:54.943

7

You can use the below code on your string and you will get the complete string without html part.

string title = "<b> Hulk Hogan's Celebrity Championship Wrestling &nbsp;&nbsp;&nbsp;<font color=\"#228b22\">[Proj # 206010]</font></b>&nbsp;&nbsp;&nbsp; (Reality Series, &nbsp;)".Replace("&nbsp;",string.Empty);            
        string s = Regex.Replace(title, "<.*?>", String.Empty);

edited Aug 09 '13 at 21:02

answered Aug 09 '13 at 20:50

Vinay

705
2
7
22

Jeff Qi · Answer 4 · 2022-07-11T20:42:51.013

I built a small function to remove HTML tags.

public static string RemoveHtmlTags(string text)
        {
            List<int> openTagIndexes = Regex.Matches(text, "<").Cast<Match>().Select(m => m.Index).ToList();
            List<int> closeTagIndexes = Regex.Matches(text, ">").Cast<Match>().Select(m => m.Index).ToList();
            if (closeTagIndexes.Count > 0)
            {
                StringBuilder sb = new StringBuilder();
                int previousIndex = 0;
                foreach (int closeTagIndex in closeTagIndexes)
                {
                    var openTagsSubset = openTagIndexes.Where(x => x >= previousIndex && x < closeTagIndex);
                    if (openTagsSubset.Count() > 0 && closeTagIndex - openTagsSubset.Max() > 1 )
                    {
                        sb.Append(text.Substring(previousIndex, openTagsSubset.Max() - previousIndex));
                    }
                    else
                    {
                        sb.Append(text.Substring(previousIndex, closeTagIndex - previousIndex + 1));
                    }
                    previousIndex = closeTagIndex + 1;
                }
                if (closeTagIndexes.Max() < text.Length)
                {
                    sb.Append(text.Substring(closeTagIndexes.Max() + 1));
                }
                return sb.ToString();
            }
            else
            {
                return text;
            }
        }

score 0 · Answer 5 · edited Jul 27 '22 at 09:01

0

public static string StripHTML(string input)
{
    if (input==null)
    {
        return string.Empty;
    }
    return Regex.Replace(input, "<.*?>", String.Empty);

}

edited Jul 27 '22 at 09:01

Shunya

2,344
4
16
28

answered Jul 27 '22 at 06:25

Khanbala Rashidov

52
6

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the [help center](https://stackoverflow.com/help/how-to-answer). – Ethan Jul 30 '22 at 02:28
1

This solution is already provided – EGN Dec 29 '22 at 22:19

How do I remove all HTML tags from a string without knowing which tags are in it?

5 Answers5

Linked

Related