How can I Convert HTML to Text in C#?

Question

I'm looking for C# code to convert an HTML document to plain text.

I'm not looking for simple tag stripping , but something that will output plain text with a reasonable preservation of the original layout.

The output should look like this:

Html2Txt at W3C

I've looked at the HTML Agility Pack, but I don't think that's what I need. Does anyone have any other suggestions?

EDIT: I just download the HTML Agility Pack from CodePlex, and ran the Html2Txt project. What a disappointment (at least the module that does html to text conversion)! All it did was strip the tags, flatten the tables, etc. The output didn't look anything like the Html2Txt @ W3C produced. Too bad that source doesn't seem to be available. I was looking to see if there is a more "canned" solution available.

EDIT 2: Thank you everybody for your suggestions. FlySwat tipped me in the direction i wanted to go. I can use the System.Diagnostics.Process class to run lynx.exe with the "-dump" switch to send the text to standard output, and capture the stdout with ProcessStartInfo.UseShellExecute = false and ProcessStartInfo.RedirectStandardOutput = true. I'll wrap all this in a C# class. This code will be called only occassionly, so i'm not too concerned about spawning a new process vs. doing it in code. Plus, Lynx is FAST!!

Why doesn't the Html Agility Pack meet your needs? Might help direct people to your specific requirement. — dommer, Apr 08 '09 at 20:29
I haven't looked at it in detail, maybe it would work? Can you point me to a code sample somewhere? — Matt Crouch, Apr 08 '09 at 20:35
Matt did you ever write this code? Would love to see the result. — Matthew, Jan 15 '10 at 22:30
I'll post it soon (got a day off this week, and this isn't too tough). Enough folks like this question, which i'm happy about! — Matt Crouch, Jan 20 '10 at 18:38
Hi Matt, did you manage to wrap lynx in a c# class - i'm faced with the same requirements & dont want to go re-inventing the wheel as it were. — HeavenCore, Oct 29 '12 at 11:27

score 54 · Answer 1 · edited Jun 29 '18 at 11:02

Just a note about the HtmlAgilityPack for posterity. The project contains an example of parsing text to html, which, as noted by the OP, does not handle whitespace at all like anyone writing HTML would envisage. There are full-text rendering solutions out there, noted by others to this question, which this is not (it cannot even handle tables in its current form), but it is lightweight and fast, which is all I wanted for creating a simple text version of HTML emails.

using System.IO;
using System.Text.RegularExpressions;
using HtmlAgilityPack;

//small but important modification to class https://github.com/zzzprojects/html-agility-pack/blob/master/src/Samples/Html2Txt/HtmlConvert.cs
public static class HtmlToText
{

    public static string Convert(string path)
    {
        HtmlDocument doc = new HtmlDocument();
        doc.Load(path);
        return ConvertDoc(doc);
    }

    public static string ConvertHtml(string html)
    {
        HtmlDocument doc = new HtmlDocument();
        doc.LoadHtml(html);
        return ConvertDoc(doc);
    }

    public static string ConvertDoc (HtmlDocument doc)
    {
        using (StringWriter sw = new StringWriter())
        {
            ConvertTo(doc.DocumentNode, sw);
            sw.Flush();
            return sw.ToString();
        }
    }

    internal static void ConvertContentTo(HtmlNode node, TextWriter outText, PreceedingDomTextInfo textInfo)
    {
        foreach (HtmlNode subnode in node.ChildNodes)
        {
            ConvertTo(subnode, outText, textInfo);
        }
    }
    public static void ConvertTo(HtmlNode node, TextWriter outText)
    {
        ConvertTo(node, outText, new PreceedingDomTextInfo(false));
    }
    internal static void ConvertTo(HtmlNode node, TextWriter outText, PreceedingDomTextInfo textInfo)
    {
        string html;
        switch (node.NodeType)
        {
            case HtmlNodeType.Comment:
                // don't output comments
                break;
            case HtmlNodeType.Document:
                ConvertContentTo(node, outText, textInfo);
                break;
            case HtmlNodeType.Text:
                // script and style must not be output
                string parentName = node.ParentNode.Name;
                if ((parentName == "script") || (parentName == "style"))
                {
                    break;
                }
                // get text
                html = ((HtmlTextNode)node).Text;
                // is it in fact a special closing node output as text?
                if (HtmlNode.IsOverlappedClosingElement(html))
                {
                    break;
                }
                // check the text is meaningful and not a bunch of whitespaces
                if (html.Length == 0)
                {
                    break;
                }
                if (!textInfo.WritePrecedingWhiteSpace || textInfo.LastCharWasSpace)
                {
                    html= html.TrimStart();
                    if (html.Length == 0) { break; }
                    textInfo.IsFirstTextOfDocWritten.Value = textInfo.WritePrecedingWhiteSpace = true;
                }
                outText.Write(HtmlEntity.DeEntitize(Regex.Replace(html.TrimEnd(), @"\s{2,}", " ")));
                if (textInfo.LastCharWasSpace = char.IsWhiteSpace(html[html.Length - 1]))
                {
                    outText.Write(' ');
                }
                    break;
            case HtmlNodeType.Element:
                string endElementString = null;
                bool isInline;
                bool skip = false;
                int listIndex = 0;
                switch (node.Name)
                {
                    case "nav":
                        skip = true;
                        isInline = false;
                        break;
                    case "body":
                    case "section":
                    case "article":
                    case "aside":
                    case "h1":
                    case "h2":
                    case "header":
                    case "footer":
                    case "address":
                    case "main":
                    case "div":
                    case "p": // stylistic - adjust as you tend to use
                        if (textInfo.IsFirstTextOfDocWritten)
                        {
                            outText.Write("\r\n");
                        }
                        endElementString = "\r\n";
                        isInline = false;
                        break;
                    case "br":
                        outText.Write("\r\n");
                        skip = true;
                        textInfo.WritePrecedingWhiteSpace = false;
                        isInline = true;
                        break;
                    case "a":
                        if (node.Attributes.Contains("href"))
                        {
                            string href = node.Attributes["href"].Value.Trim();
                            if (node.InnerText.IndexOf(href, StringComparison.InvariantCultureIgnoreCase)==-1)
                            {
                                endElementString =  "<" + href + ">";
                            }  
                        }
                        isInline = true;
                        break;
                    case "li": 
                        if(textInfo.ListIndex>0)
                        {
                            outText.Write("\r\n{0}.\t", textInfo.ListIndex++); 
                        }
                        else
                        {
                            outText.Write("\r\n*\t"); //using '*' as bullet char, with tab after, but whatever you want eg "\t->", if utf-8 0x2022
                        }
                        isInline = false;
                        break;
                    case "ol": 
                        listIndex = 1;
                        goto case "ul";
                    case "ul": //not handling nested lists any differently at this stage - that is getting close to rendering problems
                        endElementString = "\r\n";
                        isInline = false;
                        break;
                    case "img": //inline-block in reality
                        if (node.Attributes.Contains("alt"))
                        {
                            outText.Write('[' + node.Attributes["alt"].Value);
                            endElementString = "]";
                        }
                        if (node.Attributes.Contains("src"))
                        {
                            outText.Write('<' + node.Attributes["src"].Value + '>');
                        }
                        isInline = true;
                        break;
                    default:
                        isInline = true;
                        break;
                }
                if (!skip && node.HasChildNodes)
                {
                    ConvertContentTo(node, outText, isInline ? textInfo : new PreceedingDomTextInfo(textInfo.IsFirstTextOfDocWritten){ ListIndex = listIndex });
                }
                if (endElementString != null)
                {
                    outText.Write(endElementString);
                }
                break;
        }
    }
}
internal class PreceedingDomTextInfo
{
    public PreceedingDomTextInfo(BoolWrapper isFirstTextOfDocWritten)
    {
        IsFirstTextOfDocWritten = isFirstTextOfDocWritten;
    }
    public bool WritePrecedingWhiteSpace {get;set;}
    public bool LastCharWasSpace { get; set; }
    public readonly BoolWrapper IsFirstTextOfDocWritten;
    public int ListIndex { get; set; }
}
internal class BoolWrapper
{
    public BoolWrapper() { }
    public bool Value { get; set; }
    public static implicit operator bool(BoolWrapper boolWrapper)
    {
        return boolWrapper.Value;
    }
    public static implicit operator BoolWrapper(bool boolWrapper)
    {
        return new BoolWrapper{ Value = boolWrapper };
    }
}

As an example, the following HTML code...

<!DOCTYPE HTML>
<html>
    <head>
    </head>
    <body>
        <header>
            Whatever Inc.
        </header>
        <main>
            <p>
                Thanks for your enquiry. As this is the 1<sup>st</sup> time you have contacted us, we would like to clarify a few things:
            </p>
            <ol>
                <li>
                    Please confirm this is your email by replying.
                </li>
                <li>
                    Then perform this step.
                </li>
            </ol>
            <p>
                Please solve this <img alt="complex equation" src="http://upload.wikimedia.org/wikipedia/commons/8/8d/First_Equation_Ever.png"/>. Then, in any order, could you please:
            </p>
            <ul>
                <li>
                    a point.
                </li>
                <li>
                    another point, with a <a href="http://en.wikipedia.org/wiki/Hyperlink">hyperlink</a>.
                </li>
            </ul>
            <p>
                Sincerely,
            </p>
            <p>
                The whatever.com team
            </p>
        </main>
        <footer>
            Ph: 000 000 000<br/>
            mail: whatever st
        </footer>
    </body>
</html>

...will be transformed into:

Whatever Inc. 


Thanks for your enquiry. As this is the 1st time you have contacted us, we would like to clarify a few things: 

1.  Please confirm this is your email by replying. 
2.  Then perform this step. 

Please solve this [complex equation<http://upload.wikimedia.org/wikipedia/commons/8/8d/First_Equation_Ever.png>]. Then, in any order, could you please: 

*   a point. 
*   another point, with a hyperlink<http://en.wikipedia.org/wiki/Hyperlink>. 

Sincerely, 

The whatever.com team 


Ph: 000 000 000
mail: whatever st

...as opposed to:

        Whatever Inc.


            Thanks for your enquiry. As this is the 1st time you have contacted us, we would like to clarify a few things:

                Please confirm this is your email by replying.

                Then perform this step.


            Please solve this . Then, in any order, could you please:

                a point.

                another point, with a hyperlink.


            Sincerely,


            The whatever.com team

        Ph: 000 000 000
        mail: whatever st

You can also handle `node.name` `tr` new line and `td` space to improve formatting within tables. — rboy, Sep 11 '16 at 02:17
can you post a code which returns formatted output string i.e. same amount of new line space etc. — Lucifer, Jan 11 '17 at 11:13
There's a sketchy bit in the code as currently shown -- `if(textInfo.LastCharWasSpace = char.IsWhiteSpace(html[html.Length - 1]))` (note single `=`) appears to be side-effecting an assignment inside the `if()` statement. It's a code smell that makes it difficult to infer intent vs. typo. — Eric Lloyd, Sep 23 '21 at 21:11

Richard · Answer 2 · 2015-04-22T07:52:18.387

39

You could use this:

 public static string StripHTML(string HTMLText, bool decode = true)
        {
            Regex reg = new Regex("<[^>]+>", RegexOptions.IgnoreCase);
            var stripped = reg.Replace(HTMLText, "");
            return decode ? HttpUtility.HtmlDecode(stripped) : stripped;
        }

Updated

Thanks for the comments I have updated to improve this function

edited Apr 22 '15 at 07:52

answered Apr 08 '09 at 22:20

Richard

21,728
13
62
101

4

This is incomplete... for example it does not account for entities like etc... – Riko Mar 23 '12 at 13:34
2

It was awesome, and will be better if combine with **HtmlDecoded**, I mean : 'HTMLText = HttpUtility.HtmlDecode(HTMLText);' – Søren Apr 10 '12 at 19:49
4

This is actually a great example! I used this in my web application. All of our content is stored as HTML in a database. A more direct example is using it like this. string test = HttpUtility.HtmlDecode(StripHTML(htmlText)); – Steven Combs Sep 05 '12 at 17:05
2

If not in web project you could also try System.Net.WebUtiltiy.HtmlDecode() – Roman Gudkov Apr 20 '16 at 09:12
Please how can we adapt this solution to retain tab structure? – Charles Okwuagwu May 22 '16 at 18:38
2

if want to use WebUtility in Portable class library, you can use this nuget package. https://www.nuget.org/packages/PCLWebUtility/ – chenk Nov 03 '16 at 07:39
That worked so easily. Thank you! Way better than most other SO answers – Fandango68 Nov 15 '16 at 07:29

score 17 · Answer 3 · edited May 23 '17 at 12:18

17

I've heard from a reliable source that, if you're doing HTML parsing in .Net, you should look at the HTML agility pack again..

http://www.codeplex.com/htmlagilitypack

Some sample on SO..

HTML Agility pack - parsing tables

edited May 23 '17 at 12:18

Community

1
1

answered Apr 08 '09 at 20:33

madcolor

8,105
11
51
74

1

Good Luck.. I don't pretend that it's going to be easy.. but I think it's the correct path to go down. – madcolor Apr 08 '09 at 21:02

score 12 · Accepted Answer · edited Jan 23 '12 at 16:12

12

What you are looking for is a text-mode DOM renderer that outputs text, much like Lynx or other Text browsers...This is much harder to do than you would expect.

edited Jan 23 '12 at 16:12

Abel

56,041
24
146
247

answered Apr 08 '09 at 20:26

FlySwat

172,459
74
246
311

Nope, it actually makes it easier!! (see question edit). Thanks again! – Matt Crouch Apr 09 '09 at 16:39
3

@MattCrouch how does it make it easier? The Edit 2 answer in the original question is barely more than a hack - completely unacceptable to me and I suspect almost anyone's situation - would you acknowledge this? – PandaWood Nov 19 '14 at 23:56

score 4 · Answer 5 · answered Oct 18 '15 at 07:43

4

I had some decoding issues with HtmlAgility and I didn't want to invest time investigating it.

Instead I used that utility from the Microsoft Team Foundation API:

var text = HtmlFilter.ConvertToPlainText(htmlContent);

answered Oct 18 '15 at 07:43

Nir

1,836
23
26

score 3 · Answer 6 · answered Jun 07 '12 at 02:09

Assuming you have well formed html, you could also maybe try an XSL transform.

Here's an example:

using System;
using System.IO;
using System.Xml.Linq;
using System.Xml.XPath;
using System.Xml.Xsl;

class Html2TextExample
{
    public static string Html2Text(XDocument source)
    {
        var writer = new StringWriter();
        Html2Text(source, writer);
        return writer.ToString();
    }

    public static void Html2Text(XDocument source, TextWriter output)
    {
        Transformer.Transform(source.CreateReader(), null, output);
    }

    public static XslCompiledTransform _transformer;
    public static XslCompiledTransform Transformer
    {
        get
        {
            if (_transformer == null)
            {
                _transformer = new XslCompiledTransform();
                var xsl = XDocument.Parse(@"<?xml version='1.0'?><xsl:stylesheet version=""1.0"" xmlns:xsl=""http://www.w3.org/1999/XSL/Transform"" exclude-result-prefixes=""xsl""><xsl:output method=""html"" indent=""yes"" version=""4.0"" omit-xml-declaration=""yes"" encoding=""UTF-8"" /><xsl:template match=""/""><xsl:value-of select=""."" /></xsl:template></xsl:stylesheet>");
                _transformer.Load(xsl.CreateNavigator());
            }
            return _transformer;
        }
    }

    static void Main(string[] args)
    {
        var html = XDocument.Parse("<html><body><div>Hello world!</div></body></html>");
        var text = Html2Text(html);
        Console.WriteLine(text);
    }
}

score 3 · Answer 7 · answered Oct 04 '12 at 16:06

3

Because I wanted conversion to plain text with LF and bullets, I found this pretty solution on codeproject, which covers many conversion usecases:

Convert HTML to Plain Text

Yep, looks so big, but works fine.

answered Oct 04 '12 at 16:06

Vaclav Svara

359
3
6

score 3 · Answer 8 · answered Apr 08 '09 at 20:27

3

Have you tried http://www.aaronsw.com/2002/html2text/ it's Python, but open source.

answered Apr 08 '09 at 20:27

Ian G

29,468
21
78
92

EricSchaefer · Answer 9 · 2009-04-09T06:17:06.377

2

The easiest would probably be tag stripping combined with replacement of some tags with text layout elements like dashes for list elements (li) and line breaks for br's and p's. It shouldn't be too hard to extend this to tables.

edited Apr 09 '09 at 06:17

answered Apr 08 '09 at 20:27

EricSchaefer

25,272
21
67
103

good thinking, it actually pretty easy to do a rough version. – Ian G Apr 08 '09 at 20:29
Well it depends on the HTML. I wrote a quick and dirty version of this approach in php for a CMS that was sending weekly digest of the post by plain text email. In this case the editor for the posts only allowed certain HTML elements. It should be much harder if full HTML transitional is allowed. – EricSchaefer Apr 08 '09 at 20:33

Daniel Williams · Answer 10 · 2022-08-14T14:15:14.467

1

Here is the short sweet answer using HtmlAgilityPack. You can run this in LinqPad.

var html = "<div>..whatever html</div>";
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var plainText = doc.DocumentNode.InnerText;

I simply use HtmlAgilityPack in any .NET project that needs HTML parsing. It's simple, reliable, and fast.

Update - you are correct that the above removes tags but does not decode the escaped characters. This will do it:

var a = "This &amp; that";
var result = System.Web.HttpUtility.HtmlDecode(a);
result.Dump();

Using the two together you can get the plain text from the HTML.

edited Aug 14 '22 at 14:15

answered May 14 '20 at 17:01

Daniel Williams

8,912
15
68
107

It doesn't replace HTML entities like `>` with their textual representation, though. – quant_dev Aug 11 '22 at 11:47

score 0 · Answer 11 · answered Jan 11 '19 at 14:20

This function convert "What You See in the browser" to plain text with line breaks. (If you want to see result in the browser just use commented return value)

public string HtmlFileToText(string filePath)
{
    using (var browser = new WebBrowser())
    {
        string text = File.ReadAllText(filePath);
        browser.ScriptErrorsSuppressed = true;
        browser.Navigate("about:blank");
        browser?.Document?.OpenNew(false);
        browser?.Document?.Write(text);
        return browser.Document?.Body?.InnerText;
        //return browser.Document?.Body?.InnerText.Replace(Environment.NewLine, "<br />");
    }   
}

score 0 · Answer 12 · edited May 23 '17 at 12:02

Another post suggests the HTML agility pack:

This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

score 0 · Answer 13 · answered Apr 08 '09 at 22:20

0

I have used Detagger in the past. It does a pretty good job of formatting the HTML as text and is more than just a tag remover.

answered Apr 08 '09 at 22:20

Brian Genisio

47,787
16
124
167

score -1 · Answer 14 · answered Nov 05 '09 at 21:01

-1

I have recently blogged on a solution that worked for me by using a Markdown XSLT file to transform the HTML Source. The HTML source will of course need to be valid XML first

answered Nov 05 '09 at 21:01

ProNotion

3,662
3
21
30

score -1 · Answer 15 · answered Jun 16 '15 at 12:11

Try the easy and usable way: just call StripHTML(WebBrowserControl_name);

 public string StripHTML(WebBrowser webp)
        {
            try
            {
                doc.execCommand("SelectAll", true, null);
                IHTMLSelectionObject currentSelection = doc.selection;

                if (currentSelection != null)
                {
                    IHTMLTxtRange range = currentSelection.createRange() as IHTMLTxtRange;
                    if (range != null)
                    {
                        currentSelection.empty();
                        return range.text;
                    }
                }
            }
            catch (Exception ep)
            {
                //MessageBox.Show(ep.Message);
            }
            return "";

        }

score -1 · Answer 16 · answered Apr 08 '09 at 20:28

-1

I don't know C#, but there is a fairly small & easy to read python html2txt script here: http://www.aaronsw.com/2002/html2text/

answered Apr 08 '09 at 20:28

jw.

318
2
11

This is closer to what i'm looking for, but this still "flattened" the html tables. :( – Matt Crouch Apr 08 '09 at 20:50

score -2 · Answer 17 · answered Oct 25 '14 at 23:23

-2

If you are using .NET framework 4.5 you can use System.Net.WebUtility.HtmlDecode() which takes a HTML encoded string and returns a decoded string.

Documented on MSDN at: http://msdn.microsoft.com/en-us/library/system.net.webutility.htmldecode(v=vs.110).aspx

You can use this in a Windows Store app as well.

answered Oct 25 '14 at 23:23

Tyler

37
6

4

That doesn't convert HTML into text, it's for converting HTML Encoded strings into plain HTML (tags). – Rado Sep 03 '15 at 10:05

score -2 · Answer 18 · answered Jun 01 '11 at 14:47

-2

In Genexus You can made with Regex

&pattern = '<[^>]+>'

&TSTRPNOT=&TSTRPNOT.ReplaceRegEx(&pattern,"")

In Genexus possiamo gestirlo con Regex,

answered Jun 01 '11 at 14:47

user462468

1

score -3 · Answer 19 · answered Sep 22 '10 at 06:52

You can use WebBrowser control to render in memory your html content. After LoadCompleted event fired...

IHTMLDocument2 htmlDoc = (IHTMLDocument2)webBrowser.Document;
string innerHTML = htmlDoc.body.innerHTML;
string innerText = htmlDoc.body.innerText;

score -4 · Answer 20 · edited Jan 23 '12 at 16:14

-4

This is another solution to convert HTML to Text or RTF in C#:

    SautinSoft.HtmlToRtf h = new SautinSoft.HtmlToRtf();
    h.OutputFormat = HtmlToRtf.eOutputFormat.TextUnicode;
    string text = h.ConvertString(htmlString);

This library is not free, this is commercial product and it is my own product.

edited Jan 23 '12 at 16:14

Abel

56,041
24
146
247

answered Sep 29 '09 at 11:23

Maxim Sautin

133
5

6

Max, be clear that this is your product that you are recommending. All of your answers IIRC are you suggesting this product. The SO community is pretty protective and sensitive to spamming/astroturfing. If you are not clear, and if all you do here is suggest people buy your software, you are going to end up doing yourself more harm than good. – Aug 12 '10 at 22:36
1

Hi Will! Yes, it's my product - you are right, sorry that this post looks like a advertising. I'll change it right now to make it wihtout any advertising. – Maxim Sautin Aug 16 '10 at 05:37

How can I Convert HTML to Text in C#?

20 Answers20

Linked

Related