Get plain text from HTML in .NET

Question

What is the best way to get a plain text string from an HTML string?

public string GetPlainText(string htmlString)
{
    // any .NET built in utility?
}

Thanks in advance

@slandau: I want to output readable text from an HTML input. I'm not sure if something additional to remove the tags... — Daniel Peñalba, May 03 '11 at 13:52

score 46 · Answer 1 · answered May 03 '11 at 14:59

46

You can use MSHTML, which can be pretty forgiving;

//using microsoft.mshtml
HTMLDocument htmldoc = new HTMLDocument();
IHTMLDocument2 htmldoc2 = (IHTMLDocument2)htmldoc;
htmldoc2.write(new object[] { "<p>Plateau <i>of<i> <b>Leng</b><hr /><b erp=\"arp\">2 sugars please</b> <xxx>what? &amp; who?" });

string txt = htmldoc2.body.outerText;

Plateau of Leng 2 sugars please what? & who?

answered May 03 '11 at 14:59

Alex K.

171,639
30
264
288

1

Works like a charm! Should be the accepted answer. Note that you need to add reference to `Microsoft.mshtml.dll` first. – Sinan ILYAS Jul 05 '16 at 14:09
Are you sure this method is safe with HTML from untrusted sources? Does HTMLDocument.Write() execute passed scripts? – gilad905 Nov 16 '16 at 17:20
This answer is far more robust than the accepted answer (that uses just simple regex to remove tags) and is probably necessary for pages with any reasonable complexity. – Special Sauce Nov 16 '16 at 22:00
@giladmayani You could use the accepted answer here http://stackoverflow.com/a/19414886/1911540 to strip out any ` – Special Sauce Nov 16 '16 at 22:02
@SpecialSauce that is true, but don't forget that technically, Javascript can exist not only in ` – gilad905 Nov 17 '16 at 08:14
This is an under-appreciated, simple solution, particularly for generating the plain text content from an html email. – Daniel Aug 01 '21 at 10:31

Rudi Visser · Accepted Answer · 2011-05-03T14:02:26.390

25

There's no built in utility as far as I know, but depending on your requirements you could use Regular Expressions to strip out all of the tags:

string htmlString = @"<p>I'm HTML!</p>";
Regex.Replace(htmlString, @"<(.|\n)*?>", "");

edited May 03 '11 at 14:02

answered May 03 '11 at 13:48

Rudi Visser

21,350
5
71
97

6

check this epic question http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Andrey May 03 '11 at 13:57
@Andrey Haha that's a pretty awesome accepted answer. Luckily the OP didn't state exact requirements nor define the HTML string so this should catch most actual HTML cases, rather than XHTML. – Rudi Visser May 03 '11 at 14:00
3

Regex still doesn't necessarily yield final result. You need to convert *at least* `<`, `>` and `&`. If your text contains other HTML character entities like `š` (š) you need to decode all of them as well. – miroxlav Oct 15 '14 at 11:41

score 5 · Answer 3 · edited Nov 28 '17 at 02:42

5

~~There is no built-in solution in the framework.~~

If you need to parse HTML I made good experience using a library called HTML Agility Pack.
It parses an HTML file and provides access to it by DOM, similar to the XML classes.

edited Nov 28 '17 at 02:42

wp78de

18,207
7
43
71

answered May 03 '11 at 13:59

Alex

5,240
1
31
38

score 1 · Answer 4 · answered Aug 17 '15 at 15:37

Personally, I found a combination of regex and HttpUtility to be the best and shortest solution.

Return HttpUtility.HtmlDecode(
                Regex.Replace(HtmlString, "<(.|\n)*?>", "")
                )

This removes all the tags, and then decodes any of the extras like < or >

score 0 · Answer 5 · answered May 03 '11 at 13:53

0

There isn't .NET built in method to do it. But, like pointed by @rudi_visser, it can be done with Regular Expressions.

If you need to remove more than just the tags (i.e., turn &ampacirc; to â), you can use a more elaborated solution, like found here.

answered May 03 '11 at 13:53

Erick Petrucelli

14,386
8
64
84

Get plain text from HTML in .NET

5 Answers5

Linked