Take a look at the HTML Agility Pack, it's an HTML parser that you can use to extract the InnerText
from HTML nodes in a document.
As has been pointed out many times here on SO, you can't trust HTML parsing to a regular expression. There are times when it might be considered appropriate (for extremely limited tasks); but in general, HTML is too complex and too prone to irregularity. Bad things can happen when you try to parse HTML with Regular Expressions.
Using a parser such as HAP gives you much more flexibility. A (rough) example of what it might look like to use it for this task:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.Load("path to your HTML document");
StringBuilder content = new StringBuilder();
foreach (var node in doc.DocumentNode.DescendantNodesAndSelf())
{
if (!node.HasChildNodes)
{
sb.AppendLine(node.InnerText);
}
}
You can also perform XPATH queries on your document, in case you're only interested in a specific node or set of nodes:
var nodes = doc.DocumentNode.SelectNodes("your XPATH query here");
Hope this helps.