0

I parse html (in c# code as string) and need to get all phrases from html. For example html:

<div><div>text1</div>text2</div>

I want to get array of strings:

text1
text2

If regular expression is impossible, please provide algorithm how to skip all tag names, tag attributes and get only text content.

Update: it is not a dublicate for span problem, becase text can be in any tag, not only span. I need all text, except tags and attributes. Dont want to use HtmlAgility parser.

Update2: found regex (yes, it possible)

    //parse html, save text node in list
    public void FindTextHtml(string html, List<string> list)
    {
        var ms = Regex.Matches(html, @">([^<>]*)<", RegexOptions.IgnoreCase | RegexOptions.Multiline);
        foreach (Match m in ms)
        {
            var text = m.Groups[1].Value;
            list.Add(text);
        }
    }

Full source code available here

Alexey Obukhov
  • 834
  • 9
  • 18

2 Answers2

2

What you are looking for is here: Grabbing HTML Tags

The matches you are looking for would be in the ...(.*?)... group. Hope this helps

Kasper Jensen
  • 548
  • 5
  • 12
2

use HtmlAgilityPack dll to parse through XML and HTML files and then use code below to get your text :

        string path = @"path to the file";
        HtmlAgilityPack.HtmlDocument hd = new HtmlAgilityPack.HtmlDocument();
        hd.Load(path);
        string result= hd.DocumentNode.InnerText.Trim(); 

that is all of what you need

ako
  • 2,000
  • 2
  • 28
  • 34