0

I am developing a web scraper that pulls titles of articles from a webpage (http://www.espn.com/college-football/) and it works well, but it only pulls part of the articles, not all of them.

The articles are inside a <section data-everscroll="true"> so I know it stops there.

My question is, how can I gather all the articles from the page, all the way to the bottom. There are 119 total articles.

Disclaimer: I have contacted ESPN and gotten their permission to scrape and use their articles for this project.

static void Main(string[] args)
{
    var getHtmlWeb = new HtmlWeb();
    var doc = getHtmlWeb.Load("http://www.espn.com/college-football");
    var titles = doc.DocumentNode.SelectNodes("//*[@id=\"news-feed\"]/article//section//h1");

    foreach (var title in titles)
    {
        string t = title.InnerText;
        Console.WriteLine($"Title: {t}");
    }

    Console.ReadLine();
}
PoLáKoSz
  • 355
  • 1
  • 6
  • 7
Rodney Wilson
  • 369
  • 1
  • 4
  • 17
  • Looks like they are using everscroll: https://github.com/alexblack/infinite-scroll, meaning they are loading subsequent articles dynamically via javascript, which I don't believe HTML Agility Pack will handle. Probably going to need to try to mimic their API calls if you really want to do this. – Michael Weinand Jul 20 '17 at 21:44
  • Look at Selenium. https://stackoverflow.com/questions/18572651/selenium-scroll-down-a-growing-page – TyCobb Jul 20 '17 at 21:46
  • 1
    Sorry but I highly doubt someone in charge said oh sure waste our bandwidth and parse our site you unknown developer. Any reason you aren't using their RSS feeds? http://www.espn.com/espn/news/story?page=rssinfo – Rand Random Jul 20 '17 at 21:47
  • @TyCobb Thank you! I used Selenium and was able to grab all the articles titles. If you move your comment to an answer, I'll mark it. – Rodney Wilson Jul 20 '17 at 22:34

0 Answers0