134

I have snippets of Html stored in a table. Not entire pages, no tags or the like, just basic formatting.

I would like to be able to display that Html as text only, no formatting, on a given page (actually just the first 30 - 50 characters but that's the easy bit).

How do I place the "text" within that Html into a string as straight text?

So this piece of code.

<b>Hello World.</b><br/><p><i>Is there anyone out there?</i><p>

Becomes:

Hello World. Is there anyone out there?

Michel Ayres
  • 5,891
  • 10
  • 63
  • 97
Stuart Helwig
  • 9,318
  • 8
  • 51
  • 67
  • There are some good suggestions from the W3C here: http://www.w3.org/Tools/html2things.html – Rich Jun 21 '12 at 15:09
  • There's some pretty simple and straight-forward code to convert HTML to plain text at http://www.blackbeltcoder.com/Articles/strings/convert-html-to-text. – Jonathan Wood Apr 11 '11 at 15:30
  • You may want to use SgmlReader. http://code.msdn.microsoft.com/SgmlReader – Leonardo Herrera Nov 13 '08 at 12:40
  • 4
    How can a question be marked as a duplicate of a question that was asked 6 months later? Seems a little backward... – Stuart Helwig Jun 26 '13 at 02:52
  • I've written [a function that does convert HTML to plain text](http://pastebin.com/NswerNkQ). It has some limitations like e.g. not extracting links from `a` tags. I should better base my function on the [source code of PHP's html2text](https://github.com/soundasleep/html2text/blob/master/src/Html2Text.php). – Uwe Keim Sep 23 '18 at 14:14

20 Answers20

127

The MIT licensed HtmlAgilityPack has in one of its samples a method that converts from HTML to plain text.

var plainText = HtmlUtilities.ConvertToPlainText(string html);

Feed it an HTML string like

<b>hello, <i>world!</i></b>

And you'll get a plain text result like:

hello world!
Prof. Falken
  • 24,226
  • 19
  • 100
  • 173
Judah Gabriel Himango
  • 58,906
  • 38
  • 158
  • 212
  • 14
    I have used HtmlAgilityPack before but I can't see any reference to ConvertToPlainText. Are you able to tell me where i can find it? – horatio Jan 08 '10 at 03:43
  • 9
    Horatio, it is included in one of the samples that comes with HtmlAgilityPack: http://htmlagilitypack.codeplex.com/sourcecontrol/changeset/view/62772?projectName=htmlagilitypack#52179 – Judah Gabriel Himango Jan 08 '10 at 15:37
  • 1
    Before using this method, I advise to check if it really does whitelisting internally. Otherwise, it is dangerous. – usr Aug 14 '11 at 19:39
  • 8
    Actually, there isn't a built in method for this in the Agility Pack. What you linked to is an example which uses the Agility Pack to traverse the node tree, remove `script` and `style` tags and write inner text of other elements into the output string. I doubt it's passed much testing with real world inputs. – Lou Sep 02 '12 at 12:19
  • Yep, it's in the samples. It uses core HTML agility pack functionality to parse the document and spit out the text of the nodes, while skipping over styles, scripts, and comments. See the code for yourself: http://htmlagilitypack.codeplex.com/sourcecontrol/changeset/view/62772?projectName=htmlagilitypack#52179 – Judah Gabriel Himango Dec 05 '12 at 22:42
  • You'll notice the code blacklists script blocks, comments, and styles. – Judah Gabriel Himango Feb 20 '13 at 20:15
  • 4
    Can somebody please provide code that work, as opposed to links to samples that need to be retrofitted to work properly? – Eric K Sep 17 '13 at 19:24
  • Did you look at the link I posted in the comments? http://htmlagilitypack.codeplex.com/sourcecontrol/changeset/view/62772#52179 – Judah Gabriel Himango Sep 17 '13 at 19:50
  • 2
    The linked sample works nicely. For anyone struggling to use it, just copy the whole class into your own project and use the ConvertHTML method. You would also need to download and reference the HtmlAgilityPack dll into your project. – rdans Oct 16 '13 at 14:22
  • Everything ok but when i try to convert this text i get an xss alert :

    <script>alert("Ups")</script>

    , because HtmlEntity.DeEntitize() method convert &lt:script to
    – Zabaa Feb 26 '14 at 14:16
  • 3
    The supplied link doesn't parse whitespace very well. an alternate is in answer to the SO question at http://stackoverflow.com/questions/731649/how-can-i-convert-html-to-text-in-c#25178738 – Brent Aug 10 '14 at 03:53
  • 10
    The sample can now be found here: https://github.com/ceee/ReadSharp/blob/master/ReadSharp/HtmlUtilities.cs – StuartQ Jul 13 '18 at 11:59
  • Good enough to convert the HTML fragment that a [RichTextEditor](https://richtexteditor.com) provides. – Mike Finch Jul 16 '21 at 22:44
79

I could not use HtmlAgilityPack, so I wrote a second best solution for myself

private static string HtmlToPlainText(string html)
{
    const string tagWhiteSpace = @"(>|$)(\W|\n|\r)+<";//matches one or more (white space or line breaks) between '>' and '<'
    const string stripFormatting = @"<[^>]*(>|$)";//match any character between '<' and '>', even when end tag is missing
    const string lineBreak = @"<(br|BR)\s{0,1}\/{0,1}>";//matches: <br>,<br/>,<br />,<BR>,<BR/>,<BR />
    var lineBreakRegex = new Regex(lineBreak, RegexOptions.Multiline);
    var stripFormattingRegex = new Regex(stripFormatting, RegexOptions.Multiline);
    var tagWhiteSpaceRegex = new Regex(tagWhiteSpace, RegexOptions.Multiline);

    var text = html;
    //Decode html specific characters
    text = System.Net.WebUtility.HtmlDecode(text); 
    //Remove tag whitespace/line breaks
    text = tagWhiteSpaceRegex.Replace(text, "><");
    //Replace <br /> with line breaks
    text = lineBreakRegex.Replace(text, Environment.NewLine);
    //Strip formatting
    text = stripFormattingRegex.Replace(text, string.Empty);

    return text;
}
Ben Anderson
  • 7,003
  • 4
  • 40
  • 40
  • 5
    <blabla> was parsed out so I moved the text = System.Net.WebUtility.HtmlDecode(text); to the bottom of the method – Luuk Aug 20 '14 at 10:34
  • 1
    This was great, I also added a multispace condenser as the html might have been generated from a CMS: var spaceRegex = new Regex("[ ]{2,}", RegexOptions.None); – Enkode Apr 03 '16 at 08:02
  • Sometime, in the html code there is coder's new line (new line can't be seen in comment, so I show it with [new line], like:
    I [new line] miss [new line] you
    , So it suppose to show: "I miss you", but it show I [new line] miss [new line] you. This make the plain text look painful. Do you know how to fix?
    – 123iamking Jun 07 '16 at 07:24
  • @123iamking you can use this before return text; : text.Replace("[new line]", "\n"); – Eslam Badawy Oct 20 '18 at 16:51
  • I was using this and realized that sometimes it leaves '>' at the beginning of the strings. The other solution of applying regex <[^>]*> works fine. – Etienne Charland Mar 04 '19 at 16:51
  • Regex is in System.Text.RegularExpressions – Eric Barr Jun 28 '19 at 23:11
  • 1
    Am I the only one here who thinks that Regex are not to be used for parsing structured languages like HTML? https://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not – Mladen B. Jun 04 '21 at 07:23
32

If you are talking about tag stripping, it is relatively straight forward if you don't have to worry about things like <script> tags. If all you need to do is display the text without the tags you can accomplish that with a regular expression:

<[^>]*>

If you do have to worry about <script> tags and the like then you'll need something a bit more powerful then regular expressions because you need to track state, omething more like a Context Free Grammar (CFG). Althought you might be able to accomplish it with 'Left To Right' or non-greedy matching.

If you can use regular expressions there are many web pages out there with good info:

If you need the more complex behaviour of a CFG I would suggest using a third party tool, unfortunately I don't know of a good one to recommend.

vfilby
  • 9,938
  • 9
  • 49
  • 62
  • 3
    You also have to worry about > in attribute values, comments, PIs/CDATA in XML and various common malformednesses in legacy HTML. In general [X][HT]ML is not amenable to parsing with regexps. – bobince Nov 13 '08 at 12:58
  • You can accomodate the > in attribute values but making attributes a part of the regular expression. It is only the complexity of nested tags that limits the usefulness of parsing with regular expressions. – vfilby Nov 16 '08 at 15:33
  • don't you mean <[^>]*> which matches things like , and not <[^>]>* which matches things like >>> ? – Greg Jun 30 '09 at 13:16
  • 20
    This is a terrible method to do it. The correct way is to parse the HTML with a lib and to traverse the dom outputing only whitelisted content. – usr May 26 '11 at 18:09
  • 3
    @usr: The part you are referring to is the CFG part of the answer. Regex can be used for quick and dirty tag stripping, it has it's weaknesses but it is quick and it is easy. For more complicated parsing use a CFG based tool (in your parlance a lib that generates a DOM). I haven't performed the tests but I'd wager that DOM parsing is slower than regex stripping, in case performance needs to be considered. – vfilby May 29 '11 at 22:59
  • 1
    @vfilby: NO! Tag stripping is blacklisting. Just as an example what you forgot: Your regex will not strip tags which are missing the closing '>'. Did you think of that? I am not sure if this can be a problem but this proves at least that you missed this case. Who knows what else you missed. Here another one: you miss images with a javascript src attribute. NEVER do blacklisting except if security is not important. – usr May 31 '11 at 12:39
  • @usr: The part you are referring to is the CFG part of the answer. Regex can be used for quick and dirty tag stripping, it has it's weaknesses but it is quick and it is easy. For more complicated parsing use a CFG based tool (in your parlance a lib that generates a DOM). – vfilby Aug 09 '11 at 21:26
  • 1
    @vfilby, the first attack that comes to mind is writing "
    – usr Aug 14 '11 at 19:38
  • @usr Improper input can just as easily confuse a parsing library (what I refer to as a CFG) so I am not sure that is the best evidence against tag stripping. Nested tags are not supported by regex tag stripping as noted in my answer (i.e. – vfilby Jun 21 '12 at 03:43
  • 1
    @vfilby, it doesn't matter if the parsing lib is confused or not. All you need to do is take the DOM from it (any DOM at all) and output only whitelisted components. This is always safe, itdoes not matter what the parsed DOM looks like. Also, I told you multiple examples where your "simple" method will fail to remove tags. – usr Jun 21 '12 at 13:34
  • 1
    As @bobince wrote, HTML is not amenable to parsing with regular expressions. This will blow up on real world HTML, which is often malformed. – Judah Gabriel Himango Dec 05 '12 at 22:38
  • 1
    not the proper way, it only converts tags, an html text may contain line breaks, tabs and other formatting which this regex doesn't remove. – Amir Dora. Sep 23 '20 at 18:27
21

HTTPUtility.HTMLEncode() is meant to handle encoding HTML tags as strings. It takes care of all the heavy lifting for you. From the MSDN Documentation:

If characters such as blanks and punctuation are passed in an HTTP stream, they might be misinterpreted at the receiving end. HTML encoding converts characters that are not allowed in HTML into character-entity equivalents; HTML decoding reverses the encoding. For example, when embedded in a block of text, the characters < and >, are encoded as &lt; and &gt; for HTTP transmission.

HTTPUtility.HTMLEncode() method, detailed here:

public static void HtmlEncode(
  string s,
  TextWriter output
)

Usage:

String TestString = "This is a <Test String>.";
StringWriter writer = new StringWriter();
Server.HtmlEncode(TestString, writer);
String EncodedString = writer.ToString();
George Stocker
  • 57,289
  • 29
  • 176
  • 237
11

Three Step Process for converting HTML into Plain Text

First You need to Install Nuget Package For HtmlAgilityPack Second Create This class

public class HtmlToText
{
    public HtmlToText()
    {
    }

    public string Convert(string path)
    {
        HtmlDocument doc = new HtmlDocument();
        doc.Load(path);

        StringWriter sw = new StringWriter();
        ConvertTo(doc.DocumentNode, sw);
        sw.Flush();
        return sw.ToString();
    }

    public string ConvertHtml(string html)
    {
        HtmlDocument doc = new HtmlDocument();
        doc.LoadHtml(html);

        StringWriter sw = new StringWriter();
        ConvertTo(doc.DocumentNode, sw);
        sw.Flush();
        return sw.ToString();
    }

    private void ConvertContentTo(HtmlNode node, TextWriter outText)
    {
        foreach(HtmlNode subnode in node.ChildNodes)
        {
            ConvertTo(subnode, outText);
        }
    }

    public void ConvertTo(HtmlNode node, TextWriter outText)
    {
        string html;
        switch(node.NodeType)
        {
            case HtmlNodeType.Comment:
                // don't output comments
                break;

            case HtmlNodeType.Document:
                ConvertContentTo(node, outText);
                break;

            case HtmlNodeType.Text:
                // script and style must not be output
                string parentName = node.ParentNode.Name;
                if ((parentName == "script") || (parentName == "style"))
                    break;

                // get text
                html = ((HtmlTextNode)node).Text;

                // is it in fact a special closing node output as text?
                if (HtmlNode.IsOverlappedClosingElement(html))
                    break;

                // check the text is meaningful and not a bunch of whitespaces
                if (html.Trim().Length > 0)
                {
                    outText.Write(HtmlEntity.DeEntitize(html));
                }
                break;

            case HtmlNodeType.Element:
                switch(node.Name)
                {
                    case "p":
                        // treat paragraphs as crlf
                        outText.Write("\r\n");
                        break;
                }

                if (node.HasChildNodes)
                {
                    ConvertContentTo(node, outText);
                }
                break;
        }
    }
}

By using above class with reference to Judah Himango's answer

Third you need to create the Object of above class and Use ConvertHtml(HTMLContent) Method for converting HTML into Plain Text rather than ConvertToPlainText(string html);

HtmlToText htt=new HtmlToText();
var plainText = htt.ConvertHtml(HTMLContent);
Abdulqadir_WDDN
  • 658
  • 6
  • 22
8

To add to vfilby's answer, you can just perform a RegEx replace within your code; no new classes are necessary. In case other newbies like myself stumple upon this question.

using System.Text.RegularExpressions;

Then...

private string StripHtml(string source)
{
        string output;

        //get rid of HTML tags
        output = Regex.Replace(source, "<[^>]*>", string.Empty);

        //get rid of multiple blank lines
        output = Regex.Replace(output, @"^\s*$\n", string.Empty, RegexOptions.Multiline);

        return output;
}
WEFX
  • 8,298
  • 8
  • 66
  • 102
  • 22
    NOT GOOD! This can be tricked to contain script by omiting the closing angle bracket. GUYS, never do blacklisting. You _cannot_ sanitize input by blacklisting. This is so wrong. – usr May 26 '11 at 18:11
6

It has limitation that not collapsing long inline whitespace, but it is definitely portable and respects layout like webbrowser.

static string HtmlToPlainText(string html) {
  string buf;
  string block = "address|article|aside|blockquote|canvas|dd|div|dl|dt|" +
    "fieldset|figcaption|figure|footer|form|h\\d|header|hr|li|main|nav|" +
    "noscript|ol|output|p|pre|section|table|tfoot|ul|video";

  string patNestedBlock = $"(\\s*?</?({block})[^>]*?>)+\\s*";
  buf = Regex.Replace(html, patNestedBlock, "\n", RegexOptions.IgnoreCase);

  // Replace br tag to newline.
  buf = Regex.Replace(buf, @"<(br)[^>]*>", "\n", RegexOptions.IgnoreCase);

  // (Optional) remove styles and scripts.
  buf = Regex.Replace(buf, @"<(script|style)[^>]*?>.*?</\1>", "", RegexOptions.Singleline);

  // Remove all tags.
  buf = Regex.Replace(buf, @"<[^>]*(>|$)", "", RegexOptions.Multiline);

  // Replace HTML entities.
  buf = WebUtility.HtmlDecode(buf);
  return buf;
}
jeiea
  • 1,965
  • 14
  • 24
  • @Prof.Falken I admit. I think every code have pros and cons. Its cons is solidity, and pros may be simplicity (in respect of sloc). You may post a code using `XDocument`. – jeiea Feb 20 '21 at 19:26
  • This is a most reliable solution because is using HTML tags and not anything that looks like it. During mailing HTML testing, this was the absolute perfect solution. I changed "\n" for Environment.NewLine. Finally added return buf.Trim(); to the final result for my needs. Great one, this should be the best answer. – Tanner Ornelas Apr 03 '22 at 00:06
5

I think the easiest way is to make a 'string' extension method (based on what user Richard have suggested):

using System;
using System.Text.RegularExpressions;

public static class StringHelpers
{
    public static string StripHTML(this string HTMLText)
        {
            var reg = new Regex("<[^>]+>", RegexOptions.IgnoreCase);
            return reg.Replace(HTMLText, "");
        }
}

Then just use this extension method on any 'string' variable in your program:

var yourHtmlString = "<div class=\"someclass\"><h2>yourHtmlText</h2></span>";
var yourTextString = yourHtmlString.StripHTML();

I use this extension method to convert html formated comments to plain text so it will be displayed correctly on a crystal report, and it works perfect!

mikhail-t
  • 4,103
  • 7
  • 36
  • 56
3

The simplest way I found:

HtmlFilter.ConvertToPlainText(html);

The HtmlFilter class is located in Microsoft.TeamFoundation.WorkItemTracking.Controls.dll

The dll can be found in folder like this: %ProgramFiles%\Common Files\microsoft shared\Team Foundation Server\14.0\

In VS 2015, the dll also requires reference to Microsoft.TeamFoundation.WorkItemTracking.Common.dll, located in the same folder.

Roman O
  • 3,172
  • 30
  • 26
3

Update the answer for 2023. The answer is basically the same as always:

  1. Install the latest HtmlAgilityPack

  2. Create a Utility Class called HtmlUtilities which uses the HtmlAgilityPack.

  3. Use it: var plainText = HtmlUtilities.ConvertToPlainText(email.HtmlCode);

Here is the HtmlUtilities class as copied from the link above:

using HtmlAgilityPack;
using System;
using System.IO;

namespace ReadSharp
{
public class HtmlUtilities
{
/// <summary>
/// Converts HTML to plain text / strips tags.
/// </summary>
/// <param name="html">The HTML.</param>
/// <returns></returns>
public static string ConvertToPlainText(string html)
{
  HtmlDocument doc = new HtmlDocument();
  doc.LoadHtml(html);

  StringWriter sw = new StringWriter();
  ConvertTo(doc.DocumentNode, sw);
  sw.Flush();
  return sw.ToString();
}


/// <summary>
/// Count the words.
/// The content has to be converted to plain text before (using ConvertToPlainText).
/// </summary>
/// <param name="plainText">The plain text.</param>
/// <returns></returns>
public static int CountWords(string plainText)
{
  return !String.IsNullOrEmpty(plainText) ? plainText.Split(' ', '\n').Length : 0;
}


public static string Cut(string text, int length)
{
  if (!String.IsNullOrEmpty(text) && text.Length > length)
  {
    text = text.Substring(0, length - 4) + " ...";
  }
  return text;
}


private static void ConvertContentTo(HtmlNode node, TextWriter outText)
{
  foreach (HtmlNode subnode in node.ChildNodes)
  {
    ConvertTo(subnode, outText);
  }
}


private static void ConvertTo(HtmlNode node, TextWriter outText)
{
  string html;
  switch (node.NodeType)
  {
    case HtmlNodeType.Comment:
      // don't output comments
      break;

    case HtmlNodeType.Document:
      ConvertContentTo(node, outText);
      break;

    case HtmlNodeType.Text:
      // script and style must not be output
      string parentName = node.ParentNode.Name;
      if ((parentName == "script") || (parentName == "style"))
        break;

      // get text
      html = ((HtmlTextNode)node).Text;

      // is it in fact a special closing node output as text?
      if (HtmlNode.IsOverlappedClosingElement(html))
        break;

      // check the text is meaningful and not a bunch of whitespaces
      if (html.Trim().Length > 0)
      {
        outText.Write(HtmlEntity.DeEntitize(html));
      }
      break;

    case HtmlNodeType.Element:
      switch (node.Name)
      {
        case "p":
          // treat paragraphs as crlf
          outText.Write("\r\n");
          break;
        case "br":
          outText.Write("\r\n");
          break;
      }

      if (node.HasChildNodes)
      {
        ConvertContentTo(node, outText);
      }
      break;
  }
}
}
}
Greg Gum
  • 33,478
  • 39
  • 162
  • 233
1

There not a method with the name 'ConvertToPlainText' in the HtmlAgilityPack but you can convert a html string to CLEAR string with :

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlString);
var textString = doc.DocumentNode.InnerText;
Regex.Replace(textString , @"<(.|n)*?>", string.Empty).Replace("&nbsp", "");

Thats works for me. BUT I DONT FIND A METHOD WITH NAME 'ConvertToPlainText' IN 'HtmlAgilityPack'.

nawfal
  • 70,104
  • 56
  • 326
  • 368
Amine
  • 43
  • 1
  • 1
  • ok, this one is not good one - as you using additional library just to find document root node and then apply regex on whole root node? It is either you use HtmlAgilityPack to parse html node by node or use regex to process whole text as a whole. – Giedrius Feb 26 '21 at 13:51
1

I had the same question, just my html had a simple pre-known layout, like:

<DIV><P>abc</P><P>def</P></DIV>

So I ended up using such simple code:

string.Join (Environment.NewLine, XDocument.Parse (html).Root.Elements ().Select (el => el.Value))

Which outputs:

abc
def
Karlas
  • 981
  • 6
  • 5
1

I have faced similar problem and found best solution . Below code works perfect for me.

  private string ConvertHtml_Totext(string source)
    {
     try
      {
      string result;

    // Remove HTML Development formatting
    // Replace line breaks with space
    // because browsers inserts space
    result = source.Replace("\r", " ");
    // Replace line breaks with space
    // because browsers inserts space
    result = result.Replace("\n", " ");
    // Remove step-formatting
    result = result.Replace("\t", string.Empty);
    // Remove repeating spaces because browsers ignore them
    result = System.Text.RegularExpressions.Regex.Replace(result,
                                                          @"( )+", " ");

    // Remove the header (prepare first by clearing attributes)
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"<( )*head([^>])*>","<head>",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"(<( )*(/)( )*head( )*>)","</head>",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             "(<head>).*(</head>)",string.Empty,
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);

    // remove all scripts (prepare first by clearing attributes)
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"<( )*script([^>])*>","<script>",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"(<( )*(/)( )*script( )*>)","</script>",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    //result = System.Text.RegularExpressions.Regex.Replace(result,
    //         @"(<script>)([^(<script>\.</script>)])*(</script>)",
    //         string.Empty,
    //         System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"(<script>).*(</script>)",string.Empty,
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);

    // remove all styles (prepare first by clearing attributes)
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"<( )*style([^>])*>","<style>",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"(<( )*(/)( )*style( )*>)","</style>",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             "(<style>).*(</style>)",string.Empty,
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);

    // insert tabs in spaces of <td> tags
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"<( )*td([^>])*>","\t",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);

    // insert line breaks in places of <BR> and <LI> tags
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"<( )*br( )*>","\r",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"<( )*li( )*>","\r",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);

    // insert line paragraphs (double line breaks) in place
    // if <P>, <DIV> and <TR> tags
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"<( )*div([^>])*>","\r\r",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"<( )*tr([^>])*>","\r\r",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"<( )*p([^>])*>","\r\r",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);

    // Remove remaining tags like <a>, links, images,
    // comments etc - anything that's enclosed inside < >
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"<[^>]*>",string.Empty,
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);

    // replace special characters:
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @" "," ",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);

    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"&bull;"," * ",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"&lsaquo;","<",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"&rsaquo;",">",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"&trade;","(tm)",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"&frasl;","/",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"&lt;","<",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"&gt;",">",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"&copy;","(c)",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"&reg;","(r)",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    // Remove all others. More can be added, see
    // http://hotwired.lycos.com/webmonkey/reference/special_characters/
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"&(.{2,6});", string.Empty,
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);

    // for testing
    //System.Text.RegularExpressions.Regex.Replace(result,
    //       this.txtRegex.Text,string.Empty,
    //       System.Text.RegularExpressions.RegexOptions.IgnoreCase);

    // make line breaking consistent
    result = result.Replace("\n", "\r");

    // Remove extra line breaks and tabs:
    // replace over 2 breaks with 2 and over 4 tabs with 4.
    // Prepare first to remove any whitespaces in between
    // the escaped characters and remove redundant tabs in between line breaks
    result = System.Text.RegularExpressions.Regex.Replace(result,
             "(\r)( )+(\r)","\r\r",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             "(\t)( )+(\t)","\t\t",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             "(\t)( )+(\r)","\t\r",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             "(\r)( )+(\t)","\r\t",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    // Remove redundant tabs
    result = System.Text.RegularExpressions.Regex.Replace(result,
             "(\r)(\t)+(\r)","\r\r",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    // Remove multiple tabs following a line break with just one tab
    result = System.Text.RegularExpressions.Regex.Replace(result,
             "(\r)(\t)+","\r\t",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    // Initial replacement target string for line breaks
    string breaks = "\r\r\r";
    // Initial replacement target string for tabs
    string tabs = "\t\t\t\t\t";
    for (int index=0; index<result.Length; index++)
    {
        result = result.Replace(breaks, "\r\r");
        result = result.Replace(tabs, "\t\t\t\t");
        breaks = breaks + "\r";
        tabs = tabs + "\t";
    }

    // That's it.
    return result;
}
catch
{
    MessageBox.Show("Error");
    return source;
}

}

Escape characters such as \n and \r had to be removed first because they cause regexes to cease working as expected.

Moreover, to make the result string display correctly in the textbox, one might need to split it up and set textbox's Lines property instead of assigning to Text property.

this.txtResult.Lines = StripHTML(this.txtSource.Text).Split("\r".ToCharArray());

Source : https://www.codeproject.com/Articles/11902/Convert-HTML-to-Plain-Text-2

  • This worked almost perfectly for me. I required one small fix. This case was not resulting in a new line `
  • `. Simple tweak to the regex, I modified this `Regex.Replace(result, @"<( )*li( )*>", "\r"` to this `Regex.Replace(result, @"<( )*li( )*[^>]*>", "\r"`
  • – LorneCash Aug 03 '21 at 19:45