DOT NET code to convert HTML to text

Question

I'm creating a little algo to fetch text from web sites.. then find answers (will post the script once completed).

To do that, I need to convert all HTML code within and into plain readable english text.

I've manually removed all html tags, but some css entries are hard to get rid of. Any simple ideas on how to convert html to plain english text?

Thanks.

To do that, I need to convert all HTML code within body and /body into plain readable english text. (body was removed from question) — Arjun, May 11 '09 at 06:18
if you remove the tags, there should be any CSS entries left. Maybe you can post some sample that is hard to get rid of? — Francis, May 11 '09 at 06:22

score 5 · Accepted Answer · answered May 11 '09 at 06:30

5

some one already made all the work for you.

answered May 11 '09 at 06:30

balexandre

1

Isn't parsing to DOM and using InnerText would be better? – okutane May 11 '09 at 06:51
yes, if it's a valid HTML... we never get a valid one, for example, parsing the DOM as XML will throw an error with `
` but not `
`. If you are 100% sure you will have the correct HTML, good. – balexandre Apr 04 '13 at 06:27

score 0 · Answer 2 · answered May 11 '09 at 22:28

0

I developed something similar avoiding Regex's performance penalty : strip_tags equivalent for ASP.NET (can be run on desktop .NET assemblies too)

answered May 11 '09 at 22:28

Andrei Rînea

2 Answers2