3

I'm creating a little algo to fetch text from web sites.. then find answers (will post the script once completed).

To do that, I need to convert all HTML code within and into plain readable english text.

I've manually removed all html tags, but some css entries are hard to get rid of. Any simple ideas on how to convert html to plain english text?

Thanks.

Brian Rasmussen
  • 114,645
  • 34
  • 221
  • 317
Arjun
  • 961
  • 2
  • 11
  • 18
  • To do that, I need to convert all HTML code within body and /body into plain readable english text. (body was removed from question) – Arjun May 11 '09 at 06:18
  • if you remove the tags, there should be any CSS entries left. Maybe you can post some sample that is hard to get rid of? – Francis May 11 '09 at 06:22
  • I mean there should "not" be any CSS in previous comment... – Francis May 11 '09 at 06:22

2 Answers2

5

some one already made all the work for you.

balexandre
  • 73,608
  • 45
  • 233
  • 342
  • 1
    Isn't parsing to DOM and using InnerText would be better? – okutane May 11 '09 at 06:51
  • yes, if it's a valid HTML... we never get a valid one, for example, parsing the DOM as XML will throw an error with `
    ` but not `
    `. If you are 100% sure you will have the correct HTML, good.
    – balexandre Apr 04 '13 at 06:27
0

I developed something similar avoiding Regex's performance penalty : strip_tags equivalent for ASP.NET (can be run on desktop .NET assemblies too)

Andrei Rînea
  • 20,288
  • 17
  • 117
  • 166