4

I'm creating mails in one of my solutions and need to provide both html and plaintext mails from a given html page.

However, I haven't found any real good way to strip html, js and css from whatever html template the customers might provide.

Are there any simple solution to this, perhaps a component that handle all this or do I need to start puzzle with regexp? And is it even possible to create a bulletproof regexp for all possible tags?

Regards

Brian Rasmussen
  • 114,645
  • 34
  • 221
  • 317
elwis
  • 1,395
  • 2
  • 20
  • 34
  • Similar question: http://stackoverflow.com/questions/1393982/strip-everything-but-text-from-html – HABJAN Mar 31 '11 at 07:52

5 Answers5

8

Give HtmlAgilityPack a go. It has methods for extracting the text out of an HTML Document.

You basically just need to do the following:

  var doc = new HtmlDocument();
  doc.LoadHtml(htmlStr);
  var node = doc.DocumentNode;
  var textContent = node.InnerText;
carla
  • 1,970
  • 1
  • 31
  • 44
paracycle
  • 7,665
  • 1
  • 30
  • 34
1

As a component that can strip html: Html Agility Pack

wassertim
  • 3,116
  • 2
  • 24
  • 39
1

You might find the Html Agility Pack helpful to your situation.

carla
  • 1,970
  • 1
  • 31
  • 44
tdaines
  • 561
  • 3
  • 7
1

Take a look here: HTMLAgilityPack parse in the InnerHTML. There is an answer how to do it using Html Agility Pack

Community
  • 1
  • 1
Rafal Spacjer
  • 4,838
  • 2
  • 26
  • 34
0

In this page you can find a really fast algorithm to strip HTML from a string input. Although there are some issues with invalid HTML, it's still a great resource. http://www.dotnetperls.com/remove-html-tags

Mj. Logan
  • 35
  • 2
  • 6