HtmlAgilityPack parse text blocks

Question

I am making a small web analysis tool and need to somehow extract all the text blocks on a given url that contain more than X amount of words.

The method i currently use is this:

        public string getAllText(string _html)
    {
        string _allText = "";
        try
        {
            HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
            document.LoadHtml(_html);


            var root = document.DocumentNode;
            var sb = new StringBuilder();
            foreach (var node in root.DescendantNodesAndSelf())
            {
                if (!node.HasChildNodes)
                {
                    string text = node.InnerText;
                    if (!string.IsNullOrEmpty(text))
                        sb.AppendLine(text.Trim());
                }
            }

            _allText = sb.ToString();

        }
        catch (Exception)
        {
        }

        _allText = System.Web.HttpUtility.HtmlDecode(_allText);

        return _allText;
    }

The problem here is that i get all text returned, even if its a meny text, a footer text with 3 words, etc.

I want to analyse the actual content on a page, so my idea is to somehow only parse the text that could be content (ie text blocks with more than X words)

Any ideas how this could be achieved?

the html will be different, some pages might wrap text in p, some in div, etc — Jacqueline, Nov 17 '12 at 08:51

score 1 · Accepted Answer · edited May 23 '17 at 11:43

1

Well, first approach can be a simple word count analisys on each node.InnerText value using string.Split function:

string[] words;
words = text.Split((string[]) null, StringSplitOptions.RemoveEmptyEntries);

and append only text where words.Length is larger than 3.

Also see this question answer for some more tricks in raw text gathering.

edited May 23 '17 at 11:43

Community

1
1

answered Nov 17 '12 at 08:56

Petr Abdulin

33,883
9
62
96

HtmlAgilityPack parse text blocks

1 Answers1