How do you convert Html to plain text?

Question

I have snippets of Html stored in a table. Not entire pages, no tags or the like, just basic formatting.

I would like to be able to display that Html as text only, no formatting, on a given page (actually just the first 30 - 50 characters but that's the easy bit).

How do I place the "text" within that Html into a string as straight text?

So this piece of code.

<b>Hello World.</b><br/><p><i>Is there anyone out there?</i><p>

Becomes:

Hello World. Is there anyone out there?

There are some good suggestions from the W3C here: http://www.w3.org/Tools/html2things.html — Rich, Jun 21 '12 at 15:09
There's some pretty simple and straight-forward code to convert HTML to plain text at http://www.blackbeltcoder.com/Articles/strings/convert-html-to-text. — Jonathan Wood, Apr 11 '11 at 15:30
You may want to use SgmlReader. http://code.msdn.microsoft.com/SgmlReader — Leonardo Herrera, Nov 13 '08 at 12:40
How can a question be marked as a duplicate of a question that was asked 6 months later? Seems a little backward... — Stuart Helwig, Jun 26 '13 at 02:52
I've written [a function that does convert HTML to plain text](http://pastebin.com/NswerNkQ). It has some limitations like e.g. not extracting links from `a` tags. I should better base my function on the [source code of PHP's html2text](https://github.com/soundasleep/html2text/blob/master/src/Html2Text.php). — Uwe Keim, Sep 23 '18 at 14:14

score 127 · Answer 1 · edited Feb 19 '21 at 18:17

127

The MIT licensed HtmlAgilityPack has in one of its samples a method that converts from HTML to plain text.

var plainText = HtmlUtilities.ConvertToPlainText(string html);

Feed it an HTML string like

<b>hello, <i>world!</i></b>

And you'll get a plain text result like:

hello world!

edited Feb 19 '21 at 18:17

Prof. Falken

24,226
19
100
173

answered Jul 13 '09 at 19:17

Judah Gabriel Himango

58,906
38
158
212

14

I have used HtmlAgilityPack before but I can't see any reference to ConvertToPlainText. Are you able to tell me where i can find it? – horatio Jan 08 '10 at 03:43
9

Horatio, it is included in one of the samples that comes with HtmlAgilityPack: http://htmlagilitypack.codeplex.com/sourcecontrol/changeset/view/62772?projectName=htmlagilitypack#52179 – Judah Gabriel Himango Jan 08 '10 at 15:37
1

Before using this method, I advise to check if it really does whitelisting internally. Otherwise, it is dangerous. – usr Aug 14 '11 at 19:39
8

Actually, there isn't a built in method for this in the Agility Pack. What you linked to is an example which uses the Agility Pack to traverse the node tree, remove `script` and `style` tags and write inner text of other elements into the output string. I doubt it's passed much testing with real world inputs. – Lou Sep 02 '12 at 12:19
Yep, it's in the samples. It uses core HTML agility pack functionality to parse the document and spit out the text of the nodes, while skipping over styles, scripts, and comments. See the code for yourself: http://htmlagilitypack.codeplex.com/sourcecontrol/changeset/view/62772?projectName=htmlagilitypack#52179 – Judah Gabriel Himango Dec 05 '12 at 22:42
You'll notice the code blacklists script blocks, comments, and styles. – Judah Gabriel Himango Feb 20 '13 at 20:15
4

Can somebody please provide code that work, as opposed to links to samples that need to be retrofitted to work properly? – Eric K Sep 17 '13 at 19:24
Did you look at the link I posted in the comments? http://htmlagilitypack.codeplex.com/sourcecontrol/changeset/view/62772#52179 – Judah Gabriel Himango Sep 17 '13 at 19:50
2

The linked sample works nicely. For anyone struggling to use it, just copy the whole class into your own project and use the ConvertHTML method. You would also need to download and reference the HtmlAgilityPack dll into your project. – rdans Oct 16 '13 at 14:22
Everything ok but when i try to convert this text i get an xss alert :
<script>alert("Ups")</script>
, because HtmlEntity.DeEntitize() method convert &lt:script to – Zabaa Feb 26 '14 at 14:16
3

The supplied link doesn't parse whitespace very well. an alternate is in answer to the SO question at http://stackoverflow.com/questions/731649/how-can-i-convert-html-to-text-in-c#25178738 – Brent Aug 10 '14 at 03:53
10

The sample can now be found here: https://github.com/ceee/ReadSharp/blob/master/ReadSharp/HtmlUtilities.cs – StuartQ Jul 13 '18 at 11:59
Good enough to convert the HTML fragment that a [RichTextEditor](https://richtexteditor.com) provides. – Mike Finch Jul 16 '21 at 22:44

score 79 · Answer 2 · answered May 06 '13 at 21:06

79

I could not use HtmlAgilityPack, so I wrote a second best solution for myself

private static string HtmlToPlainText(string html)
{
    const string tagWhiteSpace = @"(>|$)(\W|\n|\r)+<";//matches one or more (white space or line breaks) between '>' and '<'
    const string stripFormatting = @"<[^>]*(>|$)";//match any character between '<' and '>', even when end tag is missing
    const string lineBreak = @"<(br|BR)\s{0,1}\/{0,1}>";//matches: <br>,<br/>,<br />,<BR>,<BR/>,<BR />
    var lineBreakRegex = new Regex(lineBreak, RegexOptions.Multiline);
    var stripFormattingRegex = new Regex(stripFormatting, RegexOptions.Multiline);
    var tagWhiteSpaceRegex = new Regex(tagWhiteSpace, RegexOptions.Multiline);

    var text = html;
    //Decode html specific characters
    text = System.Net.WebUtility.HtmlDecode(text); 
    //Remove tag whitespace/line breaks
    text = tagWhiteSpaceRegex.Replace(text, "><");
    //Replace <br /> with line breaks
    text = lineBreakRegex.Replace(text, Environment.NewLine);
    //Strip formatting
    text = stripFormattingRegex.Replace(text, string.Empty);

    return text;
}

answered May 06 '13 at 21:06

Ben Anderson

7,003
4
40
40

5

<blabla> was parsed out so I moved the text = System.Net.WebUtility.HtmlDecode(text); to the bottom of the method – Luuk Aug 20 '14 at 10:34
1

This was great, I also added a multispace condenser as the html might have been generated from a CMS: var spaceRegex = new Regex("[ ]{2,}", RegexOptions.None); – Enkode Apr 03 '16 at 08:02
Sometime, in the html code there is coder's new line (new line can't be seen in comment, so I show it with [new line], like:
I [new line] miss [new line] you
, So it suppose to show: "I miss you", but it show I [new line] miss [new line] you. This make the plain text look painful. Do you know how to fix? – 123iamking Jun 07 '16 at 07:24
@123iamking you can use this before return text; : text.Replace("[new line]", "\n"); – Eslam Badawy Oct 20 '18 at 16:51
I was using this and realized that sometimes it leaves '>' at the beginning of the strings. The other solution of applying regex <[^>]*> works fine. – Etienne Charland Mar 04 '19 at 16:51
Regex is in System.Text.RegularExpressions – Eric Barr Jun 28 '19 at 23:11
1

Am I the only one here who thinks that Regex are not to be used for parsing structured languages like HTML? https://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not – Mladen B. Jun 04 '21 at 07:23

vfilby · Accepted Answer · 2009-07-13T19:15:25.077

32

If you are talking about tag stripping, it is relatively straight forward if you don't have to worry about things like <script> tags. If all you need to do is display the text without the tags you can accomplish that with a regular expression:

<[^>]*>

If you do have to worry about <script> tags and the like then you'll need something a bit more powerful then regular expressions because you need to track state, omething more like a Context Free Grammar (CFG). Althought you might be able to accomplish it with 'Left To Right' or non-greedy matching.

If you can use regular expressions there are many web pages out there with good info:

If you need the more complex behaviour of a CFG I would suggest using a third party tool, unfortunately I don't know of a good one to recommend.

edited Jul 13 '09 at 19:15

answered Nov 13 '08 at 12:44

vfilby

9,938
9
49
62

3

You also have to worry about > in attribute values, comments, PIs/CDATA in XML and various common malformednesses in legacy HTML. In general [X][HT]ML is not amenable to parsing with regexps. – bobince Nov 13 '08 at 12:58
You can accomodate the > in attribute values but making attributes a part of the regular expression. It is only the complexity of nested tags that limits the usefulness of parsing with regular expressions. – vfilby Nov 16 '08 at 15:33
don't you mean <[^>]*> which matches things like , and not <[^>]>* which matches things like >>> ? – Greg Jun 30 '09 at 13:16
20

This is a terrible method to do it. The correct way is to parse the HTML with a lib and to traverse the dom outputing only whitelisted content. – usr May 26 '11 at 18:09
3

@usr: The part you are referring to is the CFG part of the answer. Regex can be used for quick and dirty tag stripping, it has it's weaknesses but it is quick and it is easy. For more complicated parsing use a CFG based tool (in your parlance a lib that generates a DOM). I haven't performed the tests but I'd wager that DOM parsing is slower than regex stripping, in case performance needs to be considered. – vfilby May 29 '11 at 22:59
1

@vfilby: NO! Tag stripping is blacklisting. Just as an example what you forgot: Your regex will not strip tags which are missing the closing '>'. Did you think of that? I am not sure if this can be a problem but this proves at least that you missed this case. Who knows what else you missed. Here another one: you miss images with a javascript src attribute. NEVER do blacklisting except if security is not important. – usr May 31 '11 at 12:39
@usr: The part you are referring to is the CFG part of the answer. Regex can be used for quick and dirty tag stripping, it has it's weaknesses but it is quick and it is easy. For more complicated parsing use a CFG based tool (in your parlance a lib that generates a DOM). – vfilby Aug 09 '11 at 21:26
1

@vfilby, the first attack that comes to mind is writing "
– usr Aug 14 '11 at 19:38
@usr Improper input can just as easily confuse a parsing library (what I refer to as a CFG) so I am not sure that is the best evidence against tag stripping. Nested tags are not supported by regex tag stripping as noted in my answer (i.e. – vfilby Jun 21 '12 at 03:43
1

@vfilby, it doesn't matter if the parsing lib is confused or not. All you need to do is take the DOM from it (any DOM at all) and output only whitelisted components. This is always safe, itdoes not matter what the parsed DOM looks like. Also, I told you multiple examples where your "simple" method will fail to remove tags. – usr Jun 21 '12 at 13:34
1

As @bobince wrote, HTML is not amenable to parsing with regular expressions. This will blow up on real world HTML, which is often malformed. – Judah Gabriel Himango Dec 05 '12 at 22:38
1

not the proper way, it only converts tags, an html text may contain line breaks, tabs and other formatting which this regex doesn't remove. – Amir Dora. Sep 23 '20 at 18:27

George Stocker · Answer 4 · 2015-06-09T15:19:22.013

HTTPUtility.HTMLEncode() is meant to handle encoding HTML tags as strings. It takes care of all the heavy lifting for you. From the MSDN Documentation:

If characters such as blanks and punctuation are passed in an HTTP stream, they might be misinterpreted at the receiving end. HTML encoding converts characters that are not allowed in HTML into character-entity equivalents; HTML decoding reverses the encoding. For example, when embedded in a block of text, the characters < and >, are encoded as < and > for HTTP transmission.

HTTPUtility.HTMLEncode() method, detailed here:

public static void HtmlEncode(
  string s,
  TextWriter output
)

Usage:

String TestString = "This is a <Test String>.";
StringWriter writer = new StringWriter();
Server.HtmlEncode(TestString, writer);
String EncodedString = writer.ToString();

A really good answer George thanks, it also highlighted how poorly I asked the question first time around. Sorry. — Stuart Helwig, Nov 14 '08 at 00:38

score 11 · Answer 5 · answered Oct 11 '17 at 06:38

Three Step Process for converting HTML into Plain Text

First You need to Install Nuget Package For HtmlAgilityPack Second Create This class

public class HtmlToText
{
    public HtmlToText()
    {
    }

    public string Convert(string path)
    {
        HtmlDocument doc = new HtmlDocument();
        doc.Load(path);

        StringWriter sw = new StringWriter();
        ConvertTo(doc.DocumentNode, sw);
        sw.Flush();
        return sw.ToString();
    }

    public string ConvertHtml(string html)
    {
        HtmlDocument doc = new HtmlDocument();
        doc.LoadHtml(html);

        StringWriter sw = new StringWriter();
        ConvertTo(doc.DocumentNode, sw);
        sw.Flush();
        return sw.ToString();
    }

    private void ConvertContentTo(HtmlNode node, TextWriter outText)
    {
        foreach(HtmlNode subnode in node.ChildNodes)
        {
            ConvertTo(subnode, outText);
        }
    }

    public void ConvertTo(HtmlNode node, TextWriter outText)
    {
        string html;
        switch(node.NodeType)
        {
            case HtmlNodeType.Comment:
                // don't output comments
                break;

            case HtmlNodeType.Document:
                ConvertContentTo(node, outText);
                break;

            case HtmlNodeType.Text:
                // script and style must not be output
                string parentName = node.ParentNode.Name;
                if ((parentName == "script") || (parentName == "style"))
                    break;

                // get text
                html = ((HtmlTextNode)node).Text;

                // is it in fact a special closing node output as text?
                if (HtmlNode.IsOverlappedClosingElement(html))
                    break;

                // check the text is meaningful and not a bunch of whitespaces
                if (html.Trim().Length > 0)
                {
                    outText.Write(HtmlEntity.DeEntitize(html));
                }
                break;

            case HtmlNodeType.Element:
                switch(node.Name)
                {
                    case "p":
                        // treat paragraphs as crlf
                        outText.Write("\r\n");
                        break;
                }

                if (node.HasChildNodes)
                {
                    ConvertContentTo(node, outText);
                }
                break;
        }
    }
}

By using above class with reference to Judah Himango's answer

Third you need to create the Object of above class and Use ConvertHtml(HTMLContent) Method for converting HTML into Plain Text rather than ConvertToPlainText(string html);

HtmlToText htt=new HtmlToText();
var plainText = htt.ConvertHtml(HTMLContent);

can i skip converting links in html. i need to keep links in html when converting to text? — coder771, Apr 06 '18 at 08:03

score 8 · Answer 6 · answered Mar 11 '11 at 18:14

To add to vfilby's answer, you can just perform a RegEx replace within your code; no new classes are necessary. In case other newbies like myself stumple upon this question.

using System.Text.RegularExpressions;

Then...

private string StripHtml(string source)
{
        string output;

        //get rid of HTML tags
        output = Regex.Replace(source, "<[^>]*>", string.Empty);

        //get rid of multiple blank lines
        output = Regex.Replace(output, @"^\s*$\n", string.Empty, RegexOptions.Multiline);

        return output;
}

NOT GOOD! This can be tricked to contain script by omiting the closing angle bracket. GUYS, never do blacklisting. You _cannot_ sanitize input by blacklisting. This is so wrong. — usr, May 26 '11 at 18:11

jeiea · Answer 7 · 2018-05-16T05:37:50.153

It has limitation that not collapsing long inline whitespace, but it is definitely portable and respects layout like webbrowser.

static string HtmlToPlainText(string html) {
  string buf;
  string block = "address|article|aside|blockquote|canvas|dd|div|dl|dt|" +
    "fieldset|figcaption|figure|footer|form|h\\d|header|hr|li|main|nav|" +
    "noscript|ol|output|p|pre|section|table|tfoot|ul|video";

  string patNestedBlock = $"(\\s*?</?({block})[^>]*?>)+\\s*";
  buf = Regex.Replace(html, patNestedBlock, "\n", RegexOptions.IgnoreCase);

  // Replace br tag to newline.
  buf = Regex.Replace(buf, @"<(br)[^>]*>", "\n", RegexOptions.IgnoreCase);

  // (Optional) remove styles and scripts.
  buf = Regex.Replace(buf, @"<(script|style)[^>]*?>.*?</\1>", "", RegexOptions.Singleline);

  // Remove all tags.
  buf = Regex.Replace(buf, @"<[^>]*(>|$)", "", RegexOptions.Multiline);

  // Replace HTML entities.
  buf = WebUtility.HtmlDecode(buf);
  return buf;
}

@Prof.Falken I admit. I think every code have pros and cons. Its cons is solidity, and pros may be simplicity (in respect of sloc). You may post a code using `XDocument`. — jeiea, Feb 20 '21 at 19:26
This is a most reliable solution because is using HTML tags and not anything that looks like it. During mailing HTML testing, this was the absolute perfect solution. I changed "\n" for Environment.NewLine. Finally added return buf.Trim(); to the final result for my needs. Great one, this should be the best answer. — Tanner Ornelas, Apr 03 '22 at 00:06

score 5 · Answer 8 · answered May 08 '12 at 21:52

I think the easiest way is to make a 'string' extension method (based on what user Richard have suggested):

using System;
using System.Text.RegularExpressions;

public static class StringHelpers
{
    public static string StripHTML(this string HTMLText)
        {
            var reg = new Regex("<[^>]+>", RegexOptions.IgnoreCase);
            return reg.Replace(HTMLText, "");
        }
}

Then just use this extension method on any 'string' variable in your program:

var yourHtmlString = "<div class=\"someclass\"><h2>yourHtmlText</h2></span>";
var yourTextString = yourHtmlString.StripHTML();

I use this extension method to convert html formated comments to plain text so it will be displayed correctly on a crystal report, and it works perfect!

score 3 · Answer 9 · answered Apr 19 '17 at 12:48

3

The simplest way I found:

HtmlFilter.ConvertToPlainText(html);

The HtmlFilter class is located in Microsoft.TeamFoundation.WorkItemTracking.Controls.dll

The dll can be found in folder like this: %ProgramFiles%\Common Files\microsoft shared\Team Foundation Server\14.0\

In VS 2015, the dll also requires reference to Microsoft.TeamFoundation.WorkItemTracking.Common.dll, located in the same folder.

answered Apr 19 '17 at 12:48

Roman O

3,172
30
26

does it take care of script tags and does it format as bold italic etc? – Samra May 09 '17 at 01:43
7

Introducing a team foundation dependency for converting html to plain text, very questionable... – ViRuSTriNiTy Mar 17 '20 at 09:00

score 3 · Answer 10 · answered Feb 10 '23 at 16:12

Update the answer for 2023. The answer is basically the same as always:

Install the latest HtmlAgilityPack
Create a Utility Class called HtmlUtilities which uses the HtmlAgilityPack.
Use it: var plainText = HtmlUtilities.ConvertToPlainText(email.HtmlCode);

Here is the HtmlUtilities class as copied from the link above:

using HtmlAgilityPack;
using System;
using System.IO;

namespace ReadSharp
{
public class HtmlUtilities
{
/// <summary>
/// Converts HTML to plain text / strips tags.
/// </summary>
/// <param name="html">The HTML.</param>
/// <returns></returns>
public static string ConvertToPlainText(string html)
{
  HtmlDocument doc = new HtmlDocument();
  doc.LoadHtml(html);

  StringWriter sw = new StringWriter();
  ConvertTo(doc.DocumentNode, sw);
  sw.Flush();
  return sw.ToString();
}


/// <summary>
/// Count the words.
/// The content has to be converted to plain text before (using ConvertToPlainText).
/// </summary>
/// <param name="plainText">The plain text.</param>
/// <returns></returns>
public static int CountWords(string plainText)
{
  return !String.IsNullOrEmpty(plainText) ? plainText.Split(' ', '\n').Length : 0;
}


public static string Cut(string text, int length)
{
  if (!String.IsNullOrEmpty(text) && text.Length > length)
  {
    text = text.Substring(0, length - 4) + " ...";
  }
  return text;
}


private static void ConvertContentTo(HtmlNode node, TextWriter outText)
{
  foreach (HtmlNode subnode in node.ChildNodes)
  {
    ConvertTo(subnode, outText);
  }
}


private static void ConvertTo(HtmlNode node, TextWriter outText)
{
  string html;
  switch (node.NodeType)
  {
    case HtmlNodeType.Comment:
      // don't output comments
      break;

    case HtmlNodeType.Document:
      ConvertContentTo(node, outText);
      break;

    case HtmlNodeType.Text:
      // script and style must not be output
      string parentName = node.ParentNode.Name;
      if ((parentName == "script") || (parentName == "style"))
        break;

      // get text
      html = ((HtmlTextNode)node).Text;

      // is it in fact a special closing node output as text?
      if (HtmlNode.IsOverlappedClosingElement(html))
        break;

      // check the text is meaningful and not a bunch of whitespaces
      if (html.Trim().Length > 0)
      {
        outText.Write(HtmlEntity.DeEntitize(html));
      }
      break;

    case HtmlNodeType.Element:
      switch (node.Name)
      {
        case "p":
          // treat paragraphs as crlf
          outText.Write("\r\n");
          break;
        case "br":
          outText.Write("\r\n");
          break;
      }

      if (node.HasChildNodes)
      {
        ConvertContentTo(node, outText);
      }
      break;
  }
}
}
}

score 1 · Answer 11 · edited Apr 26 '13 at 13:58

1

There not a method with the name 'ConvertToPlainText' in the HtmlAgilityPack but you can convert a html string to CLEAR string with :

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlString);
var textString = doc.DocumentNode.InnerText;
Regex.Replace(textString , @"<(.|n)*?>", string.Empty).Replace("&nbsp", "");

Thats works for me. BUT I DONT FIND A METHOD WITH NAME 'ConvertToPlainText' IN 'HtmlAgilityPack'.

edited Apr 26 '13 at 13:58

nawfal

70,104
56
326
368

answered Mar 26 '13 at 09:22

Amine

43
1
1

ok, this one is not good one - as you using additional library just to find document root node and then apply regex on whole root node? It is either you use HtmlAgilityPack to parse html node by node or use regex to process whole text as a whole. – Giedrius Feb 26 '21 at 13:51

score 1 · Answer 12 · answered Apr 17 '18 at 13:45

I had the same question, just my html had a simple pre-known layout, like:

<DIV><P>abc</P><P>def</P></DIV>

So I ended up using such simple code:

string.Join (Environment.NewLine, XDocument.Parse (html).Root.Elements ().Select (el => el.Value))

Which outputs:

abc
def

score 1 · Answer 13 · answered Oct 17 '18 at 19:54

I have faced similar problem and found best solution . Below code works perfect for me.

  private string ConvertHtml_Totext(string source)
    {
     try
      {
      string result;

    // Remove HTML Development formatting
    // Replace line breaks with space
    // because browsers inserts space
    result = source.Replace("\r", " ");
    // Replace line breaks with space
    // because browsers inserts space
    result = result.Replace("\n", " ");
    // Remove step-formatting
    result = result.Replace("\t", string.Empty);
    // Remove repeating spaces because browsers ignore them
    result = System.Text.RegularExpressions.Regex.Replace(result,
                                                          @"( )+", " ");

    // Remove the header (prepare first by clearing attributes)
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"<( )*head([^>])*>","<head>",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"(<( )*(/)( )*head( )*>)","</head>",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             "(<head>).*(</head>)",string.Empty,
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);

    // remove all scripts (prepare first by clearing attributes)
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"<( )*script([^>])*>","<script>",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"(<( )*(/)( )*script( )*>)","</script>",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    //result = System.Text.RegularExpressions.Regex.Replace(result,
    //         @"(<script>)([^(<script>\.</script>)])*(</script>)",
    //         string.Empty,
    //         System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"(<script>).*(</script>)",string.Empty,
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);

    // remove all styles (prepare first by clearing attributes)
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"<( )*style([^>])*>","<style>",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"(<( )*(/)( )*style( )*>)","</style>",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             "(<style>).*(</style>)",string.Empty,
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);

    // insert tabs in spaces of <td> tags
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"<( )*td([^>])*>","\t",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);

    // insert line breaks in places of <BR> and <LI> tags
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"<( )*br( )*>","\r",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"<( )*li( )*>","\r",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);

    // insert line paragraphs (double line breaks) in place
    // if <P>, <DIV> and <TR> tags
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"<( )*div([^>])*>","\r\r",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"<( )*tr([^>])*>","\r\r",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"<( )*p([^>])*>","\r\r",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);

    // Remove remaining tags like <a>, links, images,
    // comments etc - anything that's enclosed inside < >
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"<[^>]*>",string.Empty,
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);

    // replace special characters:
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @" "," ",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);

    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"&bull;"," * ",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"&lsaquo;","<",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"&rsaquo;",">",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"&trade;","(tm)",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"&frasl;","/",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"&lt;","<",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"&gt;",">",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"&copy;","(c)",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"&reg;","(r)",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    // Remove all others. More can be added, see
    // http://hotwired.lycos.com/webmonkey/reference/special_characters/
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"&(.{2,6});", string.Empty,
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);

    // for testing
    //System.Text.RegularExpressions.Regex.Replace(result,
    //       this.txtRegex.Text,string.Empty,
    //       System.Text.RegularExpressions.RegexOptions.IgnoreCase);

    // make line breaking consistent
    result = result.Replace("\n", "\r");

    // Remove extra line breaks and tabs:
    // replace over 2 breaks with 2 and over 4 tabs with 4.
    // Prepare first to remove any whitespaces in between
    // the escaped characters and remove redundant tabs in between line breaks
    result = System.Text.RegularExpressions.Regex.Replace(result,
             "(\r)( )+(\r)","\r\r",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             "(\t)( )+(\t)","\t\t",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             "(\t)( )+(\r)","\t\r",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             "(\r)( )+(\t)","\r\t",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    // Remove redundant tabs
    result = System.Text.RegularExpressions.Regex.Replace(result,
             "(\r)(\t)+(\r)","\r\r",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    // Remove multiple tabs following a line break with just one tab
    result = System.Text.RegularExpressions.Regex.Replace(result,
             "(\r)(\t)+","\r\t",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    // Initial replacement target string for line breaks
    string breaks = "\r\r\r";
    // Initial replacement target string for tabs
    string tabs = "\t\t\t\t\t";
    for (int index=0; index<result.Length; index++)
    {
        result = result.Replace(breaks, "\r\r");
        result = result.Replace(tabs, "\t\t\t\t");
        breaks = breaks + "\r";
        tabs = tabs + "\t";
    }

    // That's it.
    return result;
}
catch
{
    MessageBox.Show("Error");
    return source;
}

}

Escape characters such as \n and \r had to be removed first because they cause regexes to cease working as expected.

Moreover, to make the result string display correctly in the textbox, one might need to split it up and set textbox's Lines property instead of assigning to Text property.

this.txtResult.Lines = StripHTML(this.txtSource.Text).Split("\r".ToCharArray());

Source : https://www.codeproject.com/Articles/11902/Convert-HTML-to-Plain-Text-2

This worked almost perfectly for me. I required one small fix. This case was not resulting in a new line `

score 0 · Answer 14 · answered Nov 13 '08 at 12:41

0

If you have data that has HTML tags and you want to display it so that a person can SEE the tags, use HttpServerUtility::HtmlEncode.

If you have data that has HTML tags in it and you want the user to see the tags rendered, then display the text as is. If the text represents an entire web page, use an IFRAME for it.

If you have data that has HTML tags and you want to strip out the tags and just display the unformatted text, use a regular expression.

answered Nov 13 '08 at 12:41

Corey Trager

22,649
18
83
121

in php there is a function called striptags() maybe you have something similar – markus Nov 13 '08 at 22:46
2

"use a regular expression" NO! This would be blacklisting. You can only be safe doing whitelisting. For example whould you have remembered that the style attibute can contain "background: url('javascript:...');"? of course not, I would not have either. Thats why blacklisting does not work. – usr May 26 '11 at 18:34

score 0 · Answer 15 · answered Nov 13 '08 at 12:46

0

Depends on what you mean by "html." The most complex case would be complete web pages. That's also the easiest to handle, since you can use a text-mode web browser. See the Wikipedia article listing web browsers, including text mode browsers. Lynx is probably the best known, but one of the others may be better for your needs.

answered Nov 13 '08 at 12:46

mpez0

2,815
17
12

as he said "I have snippets of Html stored in a table. " – M at Jun 22 '16 at 17:20

Mehdi Dehghani · Answer 16 · 2021-02-23T13:34:16.907

-1

Here is my solution:

public string StripHTML(string html)
{
    if (string.IsNullOrWhiteSpace(html)) return "";

    // could be stored in static variable
    var regex = new Regex("<[^>]+>|\\s{2}", RegexOptions.IgnoreCase);
    return System.Web.HttpUtility.HtmlDecode(regex.Replace(html, ""));
}

Example:

StripHTML("<p class='test' style='color:red;'>Here is my solution:</p>");
// output -> Here is my solution:

edited Feb 23 '21 at 13:34

answered Nov 03 '17 at 09:25

Mehdi Dehghani

10,970
6
59
64

score -1 · Answer 17 · answered Apr 29 '18 at 10:16

Did not write but an using:

using HtmlAgilityPack;
using System;
using System.IO;
using System.Text.RegularExpressions;

namespace foo {
  //small but important modification to class https://github.com/zzzprojects/html-agility-pack/blob/master/src/Samples/Html2Txt/HtmlConvert.cs
  public static class HtmlToText {

    public static string Convert(string path) {
      HtmlDocument doc = new HtmlDocument();
      doc.Load(path);
      return ConvertDoc(doc);
    }

    public static string ConvertHtml(string html) {
      HtmlDocument doc = new HtmlDocument();
      doc.LoadHtml(html);
      return ConvertDoc(doc);
    }

    public static string ConvertDoc(HtmlDocument doc) {
      using (StringWriter sw = new StringWriter()) {
        ConvertTo(doc.DocumentNode, sw);
        sw.Flush();
        return sw.ToString();
      }
    }

    internal static void ConvertContentTo(HtmlNode node, TextWriter outText, PreceedingDomTextInfo textInfo) {
      foreach (HtmlNode subnode in node.ChildNodes) {
        ConvertTo(subnode, outText, textInfo);
      }
    }
    public static void ConvertTo(HtmlNode node, TextWriter outText) {
      ConvertTo(node, outText, new PreceedingDomTextInfo(false));
    }
    internal static void ConvertTo(HtmlNode node, TextWriter outText, PreceedingDomTextInfo textInfo) {
      string html;
      switch (node.NodeType) {
        case HtmlNodeType.Comment:
          // don't output comments
          break;
        case HtmlNodeType.Document:
          ConvertContentTo(node, outText, textInfo);
          break;
        case HtmlNodeType.Text:
          // script and style must not be output
          string parentName = node.ParentNode.Name;
          if ((parentName == "script") || (parentName == "style")) {
            break;
          }
          // get text
          html = ((HtmlTextNode)node).Text;
          // is it in fact a special closing node output as text?
          if (HtmlNode.IsOverlappedClosingElement(html)) {
            break;
          }
          // check the text is meaningful and not a bunch of whitespaces
          if (html.Length == 0) {
            break;
          }
          if (!textInfo.WritePrecedingWhiteSpace || textInfo.LastCharWasSpace) {
            html = html.TrimStart();
            if (html.Length == 0) { break; }
            textInfo.IsFirstTextOfDocWritten.Value = textInfo.WritePrecedingWhiteSpace = true;
          }
          outText.Write(HtmlEntity.DeEntitize(Regex.Replace(html.TrimEnd(), @"\s{2,}", " ")));
          if (textInfo.LastCharWasSpace = char.IsWhiteSpace(html[html.Length - 1])) {
            outText.Write(' ');
          }
          break;
        case HtmlNodeType.Element:
          string endElementString = null;
          bool isInline;
          bool skip = false;
          int listIndex = 0;
          switch (node.Name) {
            case "nav":
              skip = true;
              isInline = false;
              break;
            case "body":
            case "section":
            case "article":
            case "aside":
            case "h1":
            case "h2":
            case "header":
            case "footer":
            case "address":
            case "main":
            case "div":
            case "p": // stylistic - adjust as you tend to use
              if (textInfo.IsFirstTextOfDocWritten) {
                outText.Write("\r\n");
              }
              endElementString = "\r\n";
              isInline = false;
              break;
            case "br":
              outText.Write("\r\n");
              skip = true;
              textInfo.WritePrecedingWhiteSpace = false;
              isInline = true;
              break;
            case "a":
              if (node.Attributes.Contains("href")) {
                string href = node.Attributes["href"].Value.Trim();
                if (node.InnerText.IndexOf(href, StringComparison.InvariantCultureIgnoreCase) == -1) {
                  endElementString = "<" + href + ">";
                }
              }
              isInline = true;
              break;
            case "li":
              if (textInfo.ListIndex > 0) {
                outText.Write("\r\n{0}.\t", textInfo.ListIndex++);
              } else {
                outText.Write("\r\n*\t"); //using '*' as bullet char, with tab after, but whatever you want eg "\t->", if utf-8 0x2022
              }
              isInline = false;
              break;
            case "ol":
              listIndex = 1;
              goto case "ul";
            case "ul": //not handling nested lists any differently at this stage - that is getting close to rendering problems
              endElementString = "\r\n";
              isInline = false;
              break;
            case "img": //inline-block in reality
              if (node.Attributes.Contains("alt")) {
                outText.Write('[' + node.Attributes["alt"].Value);
                endElementString = "]";
              }
              if (node.Attributes.Contains("src")) {
                outText.Write('<' + node.Attributes["src"].Value + '>');
              }
              isInline = true;
              break;
            default:
              isInline = true;
              break;
          }
          if (!skip && node.HasChildNodes) {
            ConvertContentTo(node, outText, isInline ? textInfo : new PreceedingDomTextInfo(textInfo.IsFirstTextOfDocWritten) { ListIndex = listIndex });
          }
          if (endElementString != null) {
            outText.Write(endElementString);
          }
          break;
      }
    }
  }
  internal class PreceedingDomTextInfo {
    public PreceedingDomTextInfo(BoolWrapper isFirstTextOfDocWritten) {
      IsFirstTextOfDocWritten = isFirstTextOfDocWritten;
    }
    public bool WritePrecedingWhiteSpace { get; set; }
    public bool LastCharWasSpace { get; set; }
    public readonly BoolWrapper IsFirstTextOfDocWritten;
    public int ListIndex { get; set; }
  }
  internal class BoolWrapper {
    public BoolWrapper() { }
    public bool Value { get; set; }
    public static implicit operator bool(BoolWrapper boolWrapper) {
      return boolWrapper.Value;
    }
    public static implicit operator BoolWrapper(bool boolWrapper) {
      return new BoolWrapper { Value = boolWrapper };
    }
  }
}

score -1 · Answer 18 · edited Aug 05 '19 at 08:39

-1

I think it has a simple answer:

public string RemoveHTMLTags(string HTMLCode)
{
    string str=System.Text.RegularExpressions.Regex.Replace(HTMLCode, "<[^>]*>", "");
    return str;
}

edited Aug 05 '19 at 08:39

Nick

4,820
18
31
47

answered Aug 05 '19 at 08:18

user3077654

11
1

score -1 · Answer 19 · answered Jan 24 '20 at 09:49

For anyone looking for an exact solution to the OP question for a textual abbreviation of a given html document, without newlines and HTML tags, please find the solution below.

Like with every proposed solution, there are some assumptions with the code below:

script or style tags should not contain script and style tags as a part of script
only major inline elements will be inlined without space, i.e. he<span>ll</span>o should output hello. List of inline tags: https://www.w3schools.com/htmL/html_blocks.asp

Considering the above, the following string extension with compiled regular expressions will output expected plain text with regard to html escaped characters and null on null input.

public static class StringExtensions
{
    public static string ConvertToPlain(this string html)
    {
        if (html == null)
        {
            return html;
        }

        html = scriptRegex.Replace(html, string.Empty);
        html = inlineTagRegex.Replace(html, string.Empty);
        html = tagRegex.Replace(html, " ");
        html = HttpUtility.HtmlDecode(html);
        html = multiWhitespaceRegex.Replace(html, " ");

        return html.Trim();
    }

    private static readonly Regex inlineTagRegex = new Regex("<\\/?(a|span|sub|sup|b|i|strong|small|big|em|label|q)[^>]*>", RegexOptions.Compiled | RegexOptions.Singleline);
    private static readonly Regex scriptRegex = new Regex("<(script|style)[^>]*?>.*?</\\1>", RegexOptions.Compiled | RegexOptions.Singleline);
    private static readonly Regex tagRegex = new Regex("<[^>]+>", RegexOptions.Compiled | RegexOptions.Singleline);
    private static readonly Regex multiWhitespaceRegex = new Regex("\\s+", RegexOptions.Compiled | RegexOptions.Singleline);
}

José Leal · Answer 20 · 2008-11-13T12:44:18.767

-6

public static string StripTags2(string html) { return html.Replace("<", "<").Replace(">", ">"); }

By this you escape all "<" and ">" in a string. Is this what you want?

edited Nov 13 '08 at 12:44

answered Nov 13 '08 at 12:37

José Leal

7,989
9
35
54

...ah. Well now the answer (along with interpretation of the ambiguous question) has completely changed, I'll pick nits at the lack of & amp; encoding instead. ;-) – bobince Nov 13 '08 at 12:50
2

I don't think it is a good idea to reinvent the wheel - especially when your wheel is square. You should use HTMLEncode instead. – Kramii Nov 13 '08 at 15:28

How do you convert Html to plain text?

20 Answers20

Linked

Related