4

I need to get total number of words on a WebPage. This method returns the number of 336. But when I manually check from wordcounter.net, it's about 1192 words. How can I get just the word count of the article?

int kelimeSayisi()
        {
            Uri url = new Uri("https://www.fitekran.com/hamilelik-ve-spor-hamileyken-hangi-spor-nasil-yapilir/");
            WebClient client = new WebClient();
            client.Encoding = System.Text.Encoding.UTF8;
            string html = client.DownloadString(url);
            HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
            doc.LoadHtml(html);

            var kelime = doc.DocumentNode.SelectNodes("//text()").Count;
            return kelime;
        }
maliyassi
  • 79
  • 7
  • That code gets the number of text nodes, not the number of words in those text nodes. Iterate over the text nodes, get their value, and use [Counting number of words in C#](https://stackoverflow.com/q/2257993/215552). – Heretic Monkey Mar 30 '20 at 12:41
  • @HereticMonkey I'm newbie about web-scraping. Could you please help me more about this problem. What do you mean by "_Iterate over the text nodes, get their value._"? – maliyassi Mar 30 '20 at 13:04
  • Well, that's just plain old C#, but something like `foreach (string text in doc.DocumentNode.SelectNodes("//text()").Select(node => node.InnerText)) { /* do something with text */ }` – Heretic Monkey Mar 30 '20 at 13:13
  • @HereticMonkey Thank you so much. – maliyassi Mar 30 '20 at 13:44

1 Answers1

2

As HereticMonkey mentioned in a comment you're only retrieving the total number of text nodes so you need to count the words inside InnerText. Also a couple of other things you'll most likely want to do:

  • Only look in the body of the page
  • Exclude script nodes so you don't return JavaScript

I've written a modified version of your code that does that and counts the words by splitting on the space character and only treating strings that start with a letter as a word:

int kelimeSayisi()
{
    Uri url = new Uri("https://www.fitekran.com/hamilelik-ve-spor-hamileyken-hangi-spor-nasil-yapilir/");
    WebClient client = new WebClient();
    client.Encoding = System.Text.Encoding.UTF8;
    string html = client.DownloadString(url);
    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(html);

    char[] delimiter = new char[] {' '};
    int kelime = 0;
    foreach (string text in doc.DocumentNode
        .SelectNodes("//body//text()[not(parent::script)]")
        .Select(node => node.InnerText))
    {
        var words = text.Split(delimiter, StringSplitOptions.RemoveEmptyEntries)
            .Where(s => Char.IsLetter(s[0]));
        int wordCount = words.Count();
        if (wordCount > 0)
        {
            Console.WriteLine(String.Join(" ", words));
            kelime += wordCount;
        }
    }
    return kelime;
}

That returns a total word count of 1487 and also writes to the console everything that's being treated as a word so you can review what's being included. It may be that wordcounter.net is excluding a few things like headers and footers.

PeterJ
  • 3,705
  • 28
  • 51
  • 71