Web Scraping with C# and HTML Agility Pack

Question

I am developing a web scraper that pulls titles of articles from a webpage (http://www.espn.com/college-football/) and it works well, but it only pulls part of the articles, not all of them.

The articles are inside a <section data-everscroll="true"> so I know it stops there.

My question is, how can I gather all the articles from the page, all the way to the bottom. There are 119 total articles.

Disclaimer: I have contacted ESPN and gotten their permission to scrape and use their articles for this project.

static void Main(string[] args)
{
    var getHtmlWeb = new HtmlWeb();
    var doc = getHtmlWeb.Load("http://www.espn.com/college-football");
    var titles = doc.DocumentNode.SelectNodes("//*[@id=\"news-feed\"]/article//section//h1");

    foreach (var title in titles)
    {
        string t = title.InnerText;
        Console.WriteLine($"Title: {t}");
    }

    Console.ReadLine();
}

Looks like they are using everscroll: https://github.com/alexblack/infinite-scroll, meaning they are loading subsequent articles dynamically via javascript, which I don't believe HTML Agility Pack will handle. Probably going to need to try to mimic their API calls if you really want to do this. — Michael Weinand, Jul 20 '17 at 21:44
Look at Selenium. https://stackoverflow.com/questions/18572651/selenium-scroll-down-a-growing-page — TyCobb, Jul 20 '17 at 21:46
Sorry but I highly doubt someone in charge said oh sure waste our bandwidth and parse our site you unknown developer. Any reason you aren't using their RSS feeds? http://www.espn.com/espn/news/story?page=rssinfo — Rand Random, Jul 20 '17 at 21:47
@TyCobb Thank you! I used Selenium and was able to grab all the articles titles. If you move your comment to an answer, I'll mark it. — Rodney Wilson, Jul 20 '17 at 22:34

Web Scraping with C# and HTML Agility Pack

0 Answers0