87

How can I remove all the HTML tags including &nbsp using regex in C#. My string looks like

  "<div>hello</div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;</div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div>"
rampuriyaaa
  • 4,926
  • 10
  • 34
  • 41
  • 9
    Don't use a regex, check out the HTML Agility Pack. http://stackoverflow.com/questions/846994/how-to-use-html-agility-pack – Tim Oct 22 '13 at 16:58
  • 1
    Thanks Tim, but the application is quite big and intact, adding or downloading a html agility pack won't work. – rampuriyaaa Oct 22 '13 at 17:00

10 Answers10

211

If you can't use an HTML parser oriented solution to filter out the tags, here's a simple regex for it.

string noHTML = Regex.Replace(inputHTML, @"<[^>]+>|&nbsp;", "").Trim();

You should ideally make another pass through a regex filter that takes care of multiple spaces as

string noHTMLNormalised = Regex.Replace(noHTML, @"\s{2,}", " ");
Ravi K Thapliyal
  • 51,095
  • 9
  • 76
  • 89
  • I haven't yet tested this as much as I will need to, but it worked better than I expected it to work. I'll post the method I wrote below. – Don Rolling Jul 31 '14 at 14:47
  • 1
    A lazy match (`<[^>]+?>` as per @David S.) might make this a tad faster, but just used this solution in a live project - very happy +1 :) – iCollect.it Ltd Jan 13 '15 at 15:34
  • Regex.Replace(inputHTML, @"<[^>]+>|&nbsp|\n;", "").Trim(); \n is not getting removed – Mahesh Malpani Apr 08 '15 at 09:31
  • @MaheshMalpani I tried and it [works](http://imgur.com/cuoQfte) with newlines too. Try using `\r` or `\r\n` instead because your input maybe coming from a non-Unix platform. – Ravi K Thapliyal Apr 08 '15 at 15:39
  • Works perfectly, Thanks! not sure about the \r\n though. – Tauseef Sep 04 '15 at 20:25
  • 3
    I would recommend to ad a space rather than an empty string, we are catching out extra spaces any way `Regex.Replace(inputHTML, @"<[^>]+>| ", " ")` – Tauseef Sep 04 '15 at 20:47
  • @Tauseef: I agree. ` ` means space, so just to remove it would make no sense. – awe Sep 23 '15 at 07:41
  • 2
    @Tauseef If you use a space in the first replace call, you may end up leaving spaces where there were none in the original input. Say you receive `SoundCloud` as an input; you'll end up with `Sound Cloud` while it should've been stripped as `SoundCloud` because that's how it gets displayed in HTML. – Ravi K Thapliyal Sep 26 '15 at 17:02
  • @Tauseef On the same lines, you'll end up stripping `Sound Cloud` to `SoundCloud` whereas you need to output `Sound Cloud` now as that's how it gets rendered in HTML. – Ravi K Thapliyal Sep 26 '15 at 17:08
  • @awe You have a point but my answer was primarily aimed at OP and he needed the  s removed. It's perfectly fine to tweak the answer to suit your needs but I'll suggest you make a separate replace call to handle the  s the way you want for the reasons outlined in my comments above. – Ravi K Thapliyal Sep 26 '15 at 17:16
  • [You can't parse HTML with regex](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – Tim Schmelter Sep 26 '16 at 12:11
  • The second line could be done with a "safer" regex that handles newlines better - http://stackoverflow.com/a/5610133/852806 -> Regex.Replace(inputString, @"[^\S\r\n]{2,}", " "); – JsAndDotNet Nov 01 '16 at 12:11
33

I took @Ravi Thapliyal's code and made a method: It is simple and might not clean everything, but so far it is doing what I need it to do.

public static string ScrubHtml(string value) {
    var step1 = Regex.Replace(value, @"<[^>]+>|&nbsp;", "").Trim();
    var step2 = Regex.Replace(step1, @"\s{2,}", " ");
    return step2;
}
Don Rolling
  • 2,301
  • 4
  • 30
  • 27
17

I've been using this function for a while. Removes pretty much any messy html you can throw at it and leaves the text intact.

        private static readonly Regex _tags_ = new Regex(@"<[^>]+?>", RegexOptions.Multiline | RegexOptions.Compiled);

        //add characters that are should not be removed to this regex
        private static readonly Regex _notOkCharacter_ = new Regex(@"[^\w;&#@.:/\\?=|%!() -]", RegexOptions.Compiled);

        public static String UnHtml(String html)
        {
            html = HttpUtility.UrlDecode(html);
            html = HttpUtility.HtmlDecode(html);

            html = RemoveTag(html, "<!--", "-->");
            html = RemoveTag(html, "<script", "</script>");
            html = RemoveTag(html, "<style", "</style>");

            //replace matches of these regexes with space
            html = _tags_.Replace(html, " ");
            html = _notOkCharacter_.Replace(html, " ");
            html = SingleSpacedTrim(html);

            return html;
        }

        private static String RemoveTag(String html, String startTag, String endTag)
        {
            Boolean bAgain;
            do
            {
                bAgain = false;
                Int32 startTagPos = html.IndexOf(startTag, 0, StringComparison.CurrentCultureIgnoreCase);
                if (startTagPos < 0)
                    continue;
                Int32 endTagPos = html.IndexOf(endTag, startTagPos + 1, StringComparison.CurrentCultureIgnoreCase);
                if (endTagPos <= startTagPos)
                    continue;
                html = html.Remove(startTagPos, endTagPos - startTagPos + endTag.Length);
                bAgain = true;
            } while (bAgain);
            return html;
        }

        private static String SingleSpacedTrim(String inString)
        {
            StringBuilder sb = new StringBuilder();
            Boolean inBlanks = false;
            foreach (Char c in inString)
            {
                switch (c)
                {
                    case '\r':
                    case '\n':
                    case '\t':
                    case ' ':
                        if (!inBlanks)
                        {
                            inBlanks = true;
                            sb.Append(' ');
                        }   
                        continue;
                    default:
                        inBlanks = false;
                        sb.Append(c);
                        break;
                }
            }
            return sb.ToString().Trim();
        }
David S.
  • 5,965
  • 2
  • 40
  • 77
  • Just to confirm: the SingleSpacedTrim() function does the same thing as string noHTMLNormalised = Regex.Replace(noHTML, @"\s{2,}", " "); from Ravi Thapliyal's answer? – Jimmy Apr 28 '14 at 17:22
  • @Jimmy as far as I can see, that regex doesn't catch single tabs or newlines like SingleSpacedTrim() does. That could be a desirable effect though, in that case just remove the cases as needed. – David S. Apr 29 '14 at 19:25
  • Nice, but it seems to replace single and double quotes with blank spaces as well, although they are not in the "_notOkCharacter_" list, or am I missing something there? Is this part of the Decoding/Encoding meethods called at the beginning? What would be necessary to keep these characters intact? – ArgisIsland Dec 21 '16 at 10:22
4
var noHtml = Regex.Replace(inputHTML, @"<[^>]*(>|$)|&nbsp;|&zwnj;|&raquo;|&laquo;", string.Empty).Trim();
MRP
  • 499
  • 5
  • 24
2

I have used the @RaviThapliyal & @Don Rolling's code but made a little modification. Since we are replacing the &nbsp with empty string but instead &nbsp should be replaced with space, so added an additional step. It worked for me like a charm.

public static string FormatString(string value) {
    var step1 = Regex.Replace(value, @"<[^>]+>", "").Trim();
    var step2 = Regex.Replace(step1, @"&nbsp;", " ");
    var step3 = Regex.Replace(step2, @"\s{2,}", " ");
    return step3;
}

Used &nbps without semicolon because it was getting formatted by the Stack Overflow.

Sabique A Khan
  • 57
  • 1
  • 1
  • 12
1

Sanitizing an Html document involves a lot of tricky things. This package maybe of help: https://github.com/mganss/HtmlSanitizer

Ehsan88
  • 3,569
  • 5
  • 29
  • 52
  • I think it's more agains XSS attacks than to normalize html – Revious Feb 24 '19 at 14:50
  • 1
    @Revious I think you are right. Maybe my answer is not related much to the OP's question as they did not mention the purpose of removing html tags. But if the purpose is to prevent attacks, as it is in many cases, then using an already developed sanitizer may be a better approach. BTW I have no knowledge about what the meaning of **normalizing html** is. – Ehsan88 Feb 25 '19 at 20:20
0

this:

(<.+?> | &nbsp;)

will match any tag or &nbsp;

string regex = @"(<.+?>|&nbsp;)";
var x = Regex.Replace(originalString, regex, "").Trim();

then x = hello

Jonesopolis
  • 25,034
  • 12
  • 68
  • 112
0

HTML is in its basic form just XML. You could Parse your text in an XmlDocument object, and on the root element call InnerText to extract the text. This will strip all HTML tages in any form and also deal with special characters like &lt; &nbsp; all in one go.

nivs1978
  • 1,126
  • 14
  • 20
0

i'm using this syntax for remove html tags with &nbsp;

SessionTitle:result[i].sessionTitle.replace(/<[^>]+>|&**nbsp**;/g, '')
--Remove(*) **nbsp**
mymiracl
  • 583
  • 1
  • 16
  • 24
-1
(<([^>]+)>|&nbsp;)

You can test it here: https://regex101.com/r/kB0rQ4/1

FelixSFD
  • 6,052
  • 10
  • 43
  • 117