I parse html (in c# code as string) and need to get all phrases from html. For example html:
<div><div>text1</div>text2</div>
I want to get array of strings:
text1
text2
If regular expression is impossible, please provide algorithm how to skip all tag names, tag attributes and get only text content.
Update: it is not a dublicate for span problem, becase text can be in any tag, not only span. I need all text, except tags and attributes. Dont want to use HtmlAgility parser.
Update2: found regex (yes, it possible)
//parse html, save text node in list
public void FindTextHtml(string html, List<string> list)
{
var ms = Regex.Matches(html, @">([^<>]*)<", RegexOptions.IgnoreCase | RegexOptions.Multiline);
foreach (Match m in ms)
{
var text = m.Groups[1].Value;
list.Add(text);
}
}
Full source code available here