3

I'm trying to get the table with id table-matches available here. The problem is that table is loaded using ajax so I don't get the full html code when I download the page:

string url = "http://www.oddsportal.com/matches/soccer/20180701/";

using (HttpClient client = new HttpClient())
{
    using (HttpResponseMessage response = client.GetAsync(url).Result)
    {
        using (HttpContent content = response.Content)
        {
            string result = content.ReadAsStringAsync().Result;
        }
    }
}

the html returned does not contains any table, so I tried to see if there is a problem of the library, infact I setted on Chrome (specifically on the Dev console F12) javascript off and same result on the browser.

Fox fix this problem I though to use a WebBrowser, in particular:

webBrowser.Navigate("oddsportal.com/matches/soccer/20140221/"); 
HtmlElementCollection elements = webBrowser.Document.GetElementsByTagName("table");

but I want ask if I can load also the full html doing asynchronus calls, someone has encountered a similar problem?

Could you please share a solution? Thanks.

Jidic
  • 147
  • 1
  • 8

1 Answers1

7

The main issue with this page is that content inside table-matches is loaded via ajax. And neither HttpClient nor HtmlAgilityPack unable to wait for ajax to be executed. Therefore, you need different approach.

Approach #1 - Use any headless browser like PuppeteerSharp

using PuppeteerSharp;
using System;
using System.Threading.Tasks;

namespace PuppeteerSharpDemo
{
    class Program
    {
        private static String url = "http://www.oddsportal.com/matches/soccer/20180701/";

        static void Main(string[] args)
        {
            var htmlAsTask = LoadAndWaitForSelector(url, "#table-matches .table-main");
            htmlAsTask.Wait();
            Console.WriteLine(htmlAsTask.Result);

            Console.ReadKey();
        }

        public static async Task<string> LoadAndWaitForSelector(String url, String selector)
        {
            var browser = await Puppeteer.LaunchAsync(new LaunchOptions
            {
                Headless = true,
                ExecutablePath = @"c:\Program Files (x86)\Google\Chrome\Application\chrome.exe"
            });
            using (Page page = await browser.NewPageAsync())
            {
                await page.GoToAsync(url);
                await page.WaitForSelectorAsync(selector);
                return await page.GetContentAsync();
            }
        }
    }
}

In purpose of cleanness, I've posted output here here. And once you get html content you are able to parse it with HtmlAgilityPack.

Approach #2 - Use pure Selenium WebDriver. Can be launched in headless mode.

using OpenQA.Selenium;
using OpenQA.Selenium.Chrome;
using OpenQA.Selenium.Support.UI;
using System;

namespace SeleniumDemo
{
    class Program
    {
        private static IWebDriver webDriver;
        private static TimeSpan defaultWait = TimeSpan.FromSeconds(10);
        private static String targetUrl = "http://www.oddsportal.com/matches/soccer/20180701/";
        private static String driversDir = @"../../Drivers/";

        static void Main(string[] args)
        {
            webDriver = new ChromeDriver(driversDir);
            webDriver.Navigate().GoToUrl(targetUrl);
            IWebElement table = webDriver.FindElement(By.Id("table-matches"));
            var innerHtml = table.GetAttribute("innerHTML");
        }

        #region (!) I didn't even use this, but it can be useful (!)
        public static IWebElement FindElement(By by)
        {
            try
            {
                WaitForAjax();
                var wait = new WebDriverWait(webDriver, defaultWait);
                return wait.Until(driver => driver.FindElement(by));
            }
            catch
            {
                return null;
            }
        }

        public static void WaitForAjax()
        {
            var wait = new WebDriverWait(webDriver, defaultWait);
            wait.Until(d => (bool)(d as IJavaScriptExecutor).ExecuteScript("return jQuery.active == 0"));
        }
        #endregion
    }
}

Approach #3 - Simulate ajax requests

If you analyse the page loading using Fiddler or browser's profiler (F12) you can see that all data is coming with these two requests:

fiddler requests oddsportal scraping So you can try to execute them directly using HttpClient. But in this case you may need to track authorization headers and maybe something else with each HTTP request.

Andrey Kotov
  • 1,344
  • 2
  • 14
  • 26
  • Thanks for this amazing answer, I'll try the first solution, I'm interested on the third solution too, because it call the api directly without install any addons, could you help me to achieve that? this is my email: marioserdama@gmail.com – Jidic Jul 22 '18 at 10:51
  • I suggest you to use first approach as it is easy to use and also doesn't require additional addons to be installed (you only need to add PuppeteerSharp package via NuGet). As I've mentioned direct HTTP calls to ajax endpoints might be tricky. – Andrey Kotov Jul 22 '18 at 12:00
  • The problem with the first two approaches are that the dependencies aren't full compatible with .NetCore, that's why I cannot use it – Jidic Jul 22 '18 at 12:54
  • Puppeteer-Sharp is a multi-platform .NET standard 2.0 library. That means it can be used on any .NET Runtime compatible with 2.0, .NET Framework 4.6.1+ or .NET Core 2.0+ – Andrey Kotov Jul 22 '18 at 13:24
  • I saw it, but this should use Chrome.exe (as your code does), I replied to the email, can we communicate on the email? – Jidic Jul 22 '18 at 13:25
  • If you are on Linux then [**BrowserFetcher**](https://github.com/kblok/puppeteer-sharp/blob/master/lib/PuppeteerSharp/BrowserFetcher.cs) will help you download and run Chrome. – Andrey Kotov Jul 22 '18 at 15:17