31

What is the best way to get a plain text string from an HTML string?

public string GetPlainText(string htmlString)
{
    // any .NET built in utility?
}

Thanks in advance

Daniel Peñalba
  • 30,507
  • 32
  • 137
  • 219

5 Answers5

46

You can use MSHTML, which can be pretty forgiving;

//using microsoft.mshtml
HTMLDocument htmldoc = new HTMLDocument();
IHTMLDocument2 htmldoc2 = (IHTMLDocument2)htmldoc;
htmldoc2.write(new object[] { "<p>Plateau <i>of<i> <b>Leng</b><hr /><b erp=\"arp\">2 sugars please</b> <xxx>what? &amp; who?" });

string txt = htmldoc2.body.outerText;

Plateau of Leng 2 sugars please what? & who?

Alex K.
  • 171,639
  • 30
  • 264
  • 288
  • 1
    Works like a charm! Should be the accepted answer. Note that you need to add reference to `Microsoft.mshtml.dll` first. – Sinan ILYAS Jul 05 '16 at 14:09
  • Are you sure this method is safe with HTML from untrusted sources? Does HTMLDocument.Write() execute passed scripts? – gilad905 Nov 16 '16 at 17:20
  • This answer is far more robust than the accepted answer (that uses just simple regex to remove tags) and is probably necessary for pages with any reasonable complexity. – Special Sauce Nov 16 '16 at 22:00
  • @giladmayani You could use the accepted answer here http://stackoverflow.com/a/19414886/1911540 to strip out any ` – Special Sauce Nov 16 '16 at 22:02
  • @SpecialSauce that is true, but don't forget that technically, Javascript can exist not only in ` – gilad905 Nov 17 '16 at 08:14
  • This is an under-appreciated, simple solution, particularly for generating the plain text content from an html email. – Daniel Aug 01 '21 at 10:31
25

There's no built in utility as far as I know, but depending on your requirements you could use Regular Expressions to strip out all of the tags:

string htmlString = @"<p>I'm HTML!</p>";
Regex.Replace(htmlString, @"<(.|\n)*?>", "");
Rudi Visser
  • 21,350
  • 5
  • 71
  • 97
  • 6
    check this epic question http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Andrey May 03 '11 at 13:57
  • @Andrey Haha that's a pretty awesome accepted answer. Luckily the OP didn't state exact requirements nor define the HTML string so this should catch most actual HTML cases, rather than XHTML. – Rudi Visser May 03 '11 at 14:00
  • 3
    Regex still doesn't necessarily yield final result. You need to convert *at least* `<`, `>` and `&`. If your text contains other HTML character entities like `š` (š) you need to decode all of them as well. – miroxlav Oct 15 '14 at 11:41
5

There is no built-in solution in the framework.

If you need to parse HTML I made good experience using a library called HTML Agility Pack.
It parses an HTML file and provides access to it by DOM, similar to the XML classes.

wp78de
  • 18,207
  • 7
  • 43
  • 71
Alex
  • 5,240
  • 1
  • 31
  • 38
1

Personally, I found a combination of regex and HttpUtility to be the best and shortest solution.

Return HttpUtility.HtmlDecode(
                Regex.Replace(HtmlString, "<(.|\n)*?>", "")
                )

This removes all the tags, and then decodes any of the extras like &lt; or &gt;

0

There isn't .NET built in method to do it. But, like pointed by @rudi_visser, it can be done with Regular Expressions.

If you need to remove more than just the tags (i.e., turn &ampacirc; to â), you can use a more elaborated solution, like found here.

Erick Petrucelli
  • 14,386
  • 8
  • 64
  • 84