Using C# regular expressions to remove HTML tags

Question

How do I use C# regular expression to replace/remove all HTML tags, including the angle brackets? Can someone please help me with the code?

You don't indicate it, but I'm inferring that you also want to remove script and style elements entirely and not just remove the tag. The HTML Agility Pack answer below is correct for removing the tags, but to remove script and style, you'll also need something like http://stackoverflow.com/questions/13441470/htmlagilitypack-remove-script-and-style — John, Nov 14 '13 at 17:21
The question indicated as a duplicate has a lot of information (and Tony the Pony!), but it only asked for opening tags, not all tags. So I'm not sure it's technically a duplicate. That said, the answer is the same: don't. — goodeye, May 17 '14 at 00:36

score 177 · Answer 1 · edited Sep 26 '12 at 00:22

177

As often stated before, you should not use regular expressions to process XML or HTML documents. They do not perform very well with HTML and XML documents, because there is no way to express nested structures in a general way.

You could use the following.

String result = Regex.Replace(htmlDocument, @"<[^>]*>", String.Empty);

This will work for most cases, but there will be cases (for example CDATA containing angle brackets) where this will not work as expected.

edited Sep 26 '12 at 00:22

verdesmarald

11,646
2
44
60

answered Apr 25 '09 at 00:31

Daniel Brückner

59,031
16
99
143

15

This is a naive implementation.. That is,
is unfortunately, valid html. Handles most sane cases though..
– Ryan Emerle Apr 25 '09 at 00:38
8

As stated, I am aware that this expression will fail in some cases. I am not even sure if the general case can be handled by any regular expression without errors. – Daniel Brückner Apr 25 '09 at 00:49
1

No this will fail in all cases! its greedy. – Jake Apr 25 '09 at 01:04
13

@Cipher, why do you think greediness is a problem? Assuming the match starts at the beginning of a valid HTML tag, it will never extend beyond the end of that tag. That's what the [^>] is for. – Alan Moore Apr 25 '09 at 01:37
1

@AlanMoore html is not a "regular language", i.e. you can't properly match everything that is valid html with regexes. see: http://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not – Kache Mar 14 '12 at 22:36
@Jake, no it's delimited so greedy is better. If you made that ungreedy, it'd be way slower. – ChrisF May 09 '13 at 23:29
1

This is a helpful answer because I just need to rip out a.hrefs from a line of text for an email subject line. – Jason May 20 '15 at 15:12
1

For my solution I don't need to be parsing out or processing an HTML document. All I need is to strip out HTML that will be made incomplete after truncating to X number of characters (and therefore screwing up the look of the web page). – ahwm Aug 13 '15 at 18:20
why not `<[^>].+?>` ? – GRUNGER Sep 04 '15 at 16:59
It removes the hyperlink set on text as well. How can I stop removing hyperlink set on simple text? like test link – giparekh Oct 04 '17 at 13:29
@DanielBrückner Your answer removes entire elements if they are 1 letter tags such as . Instead you should use `<[^>].*?>`. – John Jan 05 '21 at 00:28

JasonTrue · Answer 2 · 2021-06-01T10:45:04.057

84

The correct answer is don't do that, use the HTML Agility Pack.

Edited to add:

To shamelessly steal from the comment below by jesse, and to avoid being accused of inadequately answering the question after all this time, here's a simple, reliable snippet using the HTML Agility Pack that works with even most imperfectly formed, capricious bits of HTML:

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(Properties.Resources.HtmlContents);
var text = doc.DocumentNode.SelectNodes("//body//text()").Select(node => node.InnerText);
StringBuilder output = new StringBuilder();
foreach (string line in text)
{
   output.AppendLine(line);
}
string textOnly = HttpUtility.HtmlDecode(output.ToString());

There are very few defensible cases for using a regular expression for parsing HTML, as HTML can't be parsed correctly without a context-awareness that's very painful to provide even in a nontraditional regex engine. You can get part way there with a RegEx, but you'll need to do manual verifications.

Html Agility Pack can provide you a robust solution that will reduce the need to manually fix up the aberrations that can result from naively treating HTML as a context-free grammar.

A regular expression may get you mostly what you want most of the time, but it will fail on very common cases. If you can find a better/faster parser than HTML Agility Pack, go for it, but please don't subject the world to more broken HTML hackery.

edited Jun 01 '21 at 10:45

answered Apr 25 '09 at 00:51

JasonTrue

19,244
4
34
61

28

HTML Agility Pack is not the answer to everything related to working with HTML (e.g. what if you only want to work with fragments of the HTML code?!). – PropellerHead Oct 23 '09 at 07:23
7

It works pretty well with fragments of HTML, and it's the best option for the scenario described by the original poster. A Regex, on the other hand, only work with an idealized HTML and will break with perfectly valid HTML, because the grammar of HTML is not regular. If he were using Ruby, I still would have suggested nokogiri or hpricot, or beautifulsoup for Python. It's best to treat HTML like HTML, not some arbitrary text stream with no grammar. – JasonTrue Oct 23 '09 at 15:54
1

HTML is not a regular grammar, and therefore cannot be parsed solely with regular expressions. You can use regexes for lexing, but not for parsing. It's really that simple. Linguists would have agreed on this before HTML even existed. – JasonTrue Mar 15 '11 at 15:43
20

This isn't a matter of opinion. A regular expression may get you mostly what you want most of the time, but it will fail on very common cases. If you can find a better/faster parser than HTML Agility Pack, go for it, but please don't subject the world to more broken HTML hackery. – JasonTrue Mar 15 '11 at 15:52
Regex can work with basic and simple parsing. however, you still can extract all of the links from any HTML as the links or URLs have a fixed pattern... – Desolator Dec 12 '11 at 08:40
1

I don't understand what you guys are arguing over? The OP wants to replace/remove HTML tags, and mentioned nothing of parsing it. HTML Agility Pack is overkill. – Dylan Vester Dec 15 '11 at 19:55
2

You can't correctly identify HTML tags reliably without parsing HTML. Do you understand all of the grammar for HTML? See the evil hack to get "pretty close" that other answers suggest, and tell me why you'd want to have to maintain that. Downvoting me because a hacky quick attempt works for your sample input isn't going to make your solution correct. I've occasionally used regexes the generate reports from HTML content or to fix up some CSS reference using negative matching on > to limit the chance of errors, but the we did additional verifications; it wasn't general purpose. – JasonTrue Dec 16 '11 at 09:13
1

The following code would do: `HtmlDocument doc = new HtmlDocument(); doc.LoadHtml(Properties.Resources.HtmlContents); var text = doc.DocumentNode.SelectNodes("//body//text()").Select(node => node.InnerText); StringBuilder output = new StringBuilder(); foreach (string line in text) { output.AppendLine(line); } string textOnly = HttpUtility.HtmlDecode(output.ToString());` – jessehouwing Apr 03 '12 at 11:58
the link for the HTML-AGILITY-PACK will be broken soon (1 July 2021) so this is the new one https://html-agility-pack.net/?z=codeplex – Ben.S May 31 '21 at 23:59

score 39 · Answer 3 · answered Apr 25 '09 at 02:59

The question is too broad to be answered definitively. Are you talking about removing all tags from a real-world HTML document, like a web page? If so, you would have to:

remove the <!DOCTYPE declaration or <?xml prolog if they exist
remove all SGML comments
remove the entire HEAD element
remove all SCRIPT and STYLE elements
do Grabthar-knows-what with FORM and TABLE elements
remove the remaining tags
remove the <![CDATA[ and ]]> sequences from CDATA sections but leave their contents alone

That's just off the top of my head--I'm sure there's more. Once you've done all that, you'll end up with words, sentences and paragraphs run together in some places, and big chunks of useless whitespace in others.

But, assuming you're working with just a fragment and you can get away with simply removing all tags, here's the regex I would use:

@"(?></?\w+)(?>(?:[^>'""]+|'[^']*'|""[^""]*"")*)>"

Matching single- and double-quoted strings in their own alternatives is sufficient to deal with the problem of angle brackets in attribute values. I don't see any need to explicitly match the attribute names and other stuff inside the tag, like the regex in Ryan's answer does; the first alternative handles all of that.

In case you're wondering about those (?>...) constructs, they're atomic groups. They make the regex a little more efficient, but more importantly, they prevent runaway backtracking, which is something you should always watch out for when you mix alternation and nested quantifiers as I've done. I don't really think that would be a problem here, but I know if I don't mention it, someone else will. ;-)

This regex isn't perfect, of course, but it's probably as good as you'll ever need.

This is by far the best answer. You answer the poster's question and explain why a regular expression should not be used for the given task. Well done. — JWilliams, Jan 27 '12 at 17:52

score 28 · Answer 4 · answered Apr 25 '09 at 00:31

28

Regex regex = new Regex(@"</?\w+((\s+\w+(\s*=\s*(?:"".*?""|'.*?'|[^'"">\s]+))?)+\s*|\s*)/?>", RegexOptions.Singleline);

Source

answered Apr 25 '09 at 00:31

Ryan Emerle

15,461
8
52
69

score 20 · Answer 5 · answered May 17 '12 at 21:55

20

@JasonTrue is correct, that stripping HTML tags should not be done via regular expressions.

It's quite simple to strip HTML tags using HtmlAgilityPack:

public string StripTags(string input) {
    var doc = new HtmlDocument();
    doc.LoadHtml(input ?? "");
    return doc.DocumentNode.InnerText;
}

answered May 17 '12 at 21:55

zzzzBov

174,988
54
320
367

1

Whilst I'm a bit late on this I'd like to mention that this also works on xml such as that produced by Word and other office products. anyone who's ever had the need to deal with Word xml would do well to look at using this because it does help a lot, especially if you need to strip tags from content which is exactly what I needed it for. – Steve Pettifer Apr 09 '13 at 08:18
When all else seemed to fail, this simple code snippet saved the day. Thanks! – Ted Krapf Mar 06 '20 at 03:49
Anyone got the exception "Illegal characters in path." when the debug runs to the line doc.LoadHtml? – anhtv13 Jan 25 '21 at 07:32
I am wondering why do we need to specify the "??" and "" characters in doc.LoadHtml()? I tried without these characters and the method did not work for me. – Ruslan Feb 22 '22 at 15:14

score 14 · Answer 6 · answered Jan 13 '12 at 12:22

I would like to echo Jason's response though sometimes you need to naively parse some Html and pull out the text content.

I needed to do this with some Html which had been created by a rich text editor, always fun and games.

In this case you may need to remove the content of some tags as well as just the tags themselves.

In my case and tags were thrown into this mix. Some one may find my (very slightly) less naive implementation a useful starting point.

   /// <summary>
    /// Removes all html tags from string and leaves only plain text
    /// Removes content of <xml></xml> and <style></style> tags as aim to get text content not markup /meta data.
    /// </summary>
    /// <param name="input"></param>
    /// <returns></returns>
    public static string HtmlStrip(this string input)
    {
        input = Regex.Replace(input, "<style>(.|\n)*?</style>",string.Empty);
        input = Regex.Replace(input, @"<xml>(.|\n)*?</xml>", string.Empty); // remove all <xml></xml> tags and anything inbetween.  
        return Regex.Replace(input, @"<(.|\n)*?>", string.Empty); // remove any tags but not there content "<p>bob<span> johnson</span></p>" becomes "bob johnson"
    }

Apart from obvious crossplatform linebreak issues, having an ungreedy quantifier is slow when the content is delimited. Use things like `.*(?!)` with the `RegexOptions.SingleLine` modifier for the first two and `<[^>]*>` for the last. The first ones can also be combined by a captured alternation in the first tag name and backreferences to it in the negative lookahead and final tag. — ChrisF, May 09 '13 at 23:38

score 6 · Answer 7 · answered Jul 11 '12 at 13:28

try regular expression method at this URL: http://www.dotnetperls.com/remove-html-tags

/// <summary>
/// Remove HTML from string with Regex.
/// </summary>
public static string StripTagsRegex(string source)
{
return Regex.Replace(source, "<.*?>", string.Empty);
}

/// <summary>
/// Compiled regular expression for performance.
/// </summary>
static Regex _htmlRegex = new Regex("<.*?>", RegexOptions.Compiled);

/// <summary>
/// Remove HTML from string with compiled Regex.
/// </summary>
public static string StripTagsRegexCompiled(string source)
{
return _htmlRegex.Replace(source, string.Empty);
}

score 4 · Answer 8 · edited Apr 02 '12 at 22:23

4

use this..

@"(?></?\w+)(?>(?:[^>'""]+|'[^']*'|""[^""]*"")*)>"

edited Apr 02 '12 at 22:23

Michael Fredrickson

36,839
5
92
109

answered Dec 13 '10 at 10:30

Swaroop

49
1

score 2 · Answer 9 · answered Sep 04 '15 at 16:56

2

Add .+? in <[^>]*> and try this regex (base on this):

<[^>].+?>

c# .net regex demo

answered Sep 04 '15 at 16:56

GRUNGER

486
3
14

score -2 · Answer 10 · edited Nov 03 '16 at 06:20

Use this method to remove tags:

public string From_To(string text, string from, string to)
{
    if (text == null)
        return null;
    string pattern = @"" + from + ".*?" + to;
    Regex rx = new Regex(pattern, RegexOptions.Compiled | RegexOptions.IgnoreCase);
    MatchCollection matches = rx.Matches(text);
    return matches.Count <= 0 ? text : matches.Cast<Match>().Where(match => !string.IsNullOrEmpty(match.Value)).Aggregate(text, (current, match) => current.Replace(match.Value, ""));
}

Using C# regular expressions to remove HTML tags

10 Answers10

Linked

Related