Fetching data from a web page to a C# application

Question

I am trying to create a desktop application in C# that will retrieve data from a website. In short, this is an application that I will use to create statistics for my local league's fantasy football (soccer) game. All the data I want to use is freely available online, but there are no APIs available to retrieve the data.

The first thing I tried was to get the HTML code for the website using WebClient and DownloadString:

WebClient client = new WebClient();
string priceChangeString = client.DownloadString(url);

However, it turned out that the data is not in the HTML string.

If I use Developer Tools in Chrome I can inspect the page under "elements". Here I see that the data I want:

Screenshot from Chrome Developer Tools

I have tried to get these values by using "Copy as XPath" and HtmlAgilityPack, but I can't get this to work my code:

using HtmlAgilityPack;

string url = "https://fantasy.eliteserien.no/a/statistics/cost_change_start";

HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);

string userscore = doc.DocumentNode.SelectNodes("//*[@id=\"main\"]/div[3]/div/div[2]/div[1]/div[2]/div[1]/div/div[2]/a/span[1]")[0].InnerText;

I have tried several variations of this code, but they all returns NullReferenceExceptions:

Unhandled Exception: System.NullReferenceException: Object reference not set to an instance of an object.

at FantasyTest.Program.Main(String[] args) in C:\Users\my_username\source\repos\FantasyTest\FantasyTest\Program.cs:line 27

Does anyone see what I'm doing wrong when I try to use HtmlAgilityPack and XPath? Are there any other approaches I can take to solve this?

The web page from this example can be found here

Relying on the div structure and order sounds like a very bad idea. Try to find some IDs or class names which uniquely identify your div. — Yeldar Kurmangaliyev, May 05 '19 at 10:20
Most likely the data is generated by javascript and an HTTP call from c# is not going to execute the javascript. Maybe [this will be of use](https://stackoverflow.com/questions/24288726/scraping-webpage-generated-by-javascript-with-c-sharp) — Crowcoder, May 05 '19 at 11:40
Thank you for the response. Unfortunately it looks like the technique used in that example (using the PhanthomJS driver in my cs-file) no longer works: https://stackoverflow.com/questions/52442100/selenium-phantomjs-is-invalid-namespace — Snorre, May 05 '19 at 12:44
A lot of web sites like the one you're referencing have web services you can call (possibly/probably for a fee) which will return the specific data you want in a digestible format. You might investigate finding just such a service. Maybe that the site itself gets the data from such a provider. — Clay, May 05 '19 at 22:02

score 1 · Answer 1 · edited May 05 '19 at 12:24

I used a list to store all the information, and then search through that list for example <span>, and in all the <spans> I made the application to search for class="card-list".

var url = "https://fantasy.eliteserien.no/a/statistics/cost_change_start";
var httpClient = new HttpClient();
var html = await httpClient.GetStringAsync(url);
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(html);
//This is the part of the code that takes information from the website
//Note that this part matches your screenshot, in the HTML code
//You can use that there is a table with class="ism-table ism-table--el"
//This piece of code target that specific table
var ProductsHtml = htmlDocument.DocumentNode.Descendants("table")
    .Where(node => node.GetAttributeValue("class", "")
    .Equals("ism-table ism-table--el")).ToList(); ;
    try{
    var ProductListItems = ProductsHtml[0].Descendants("tr")
    foreach (var ProductListItem in ProductListItems)
    {
        //This targets whats inside the table
        Console.WriteLine("Id: " +
        ProductListItem.Descendants("<HEADER>")
        .Where(node => node.GetAttributeValue("<CLASS>", "")
        .Equals("<CLASS=>")).FirstOrDefault().InnerText
    );
}

In your case I think you need regex to match the numbers. This site have the numbers in <td>number</td> format. What we need is <td class="mNOK">number</td>. So you need to use regex to match all the numbers. To do that we do:

//Regex Match numbers in <td>
Console.WriteLine("numbers: " +
Regex.Match(ProductListItem.Descendants("td").FirstOrDefault().InnerText
, @[0-9]")
);

Note that you need to change <URL>, <HEADER>, <CLASS> and <CLASS=>.

<URL>: The site you want to take information from, <HEADER>: What header inside the HTML code do you want to target reading. For example "span, div, li, ul", <CLASS>: Inside that header, what do you want to look for. Example "id, name", <CLASS=>: What does the <CLASS> need to be equal to, to read the inner text

Thank you for the response. Unfortunatly, this code has a count of 0 after it is ran: var ProductsHtml = htmlDocument.DocumentNode.Descendants("table") .Where(node => node.GetAttributeValue("class", "") .Equals("ism-table ism-table--el")).ToList(); Should this method be able to retrieve the data even though they are not present directly in the htmlDocument object? — Snorre, May 05 '19 at 11:55
Add a break after `var ProductsHtml = htmlDocument.DocumentNode.Descendants("table") .Where(node => node.GetAttributeValue("class", "") .Equals("ism-table ism-table--el")).ToList();` As seen [here](https://ibb.co/d7J3Q5d). — SablyTv, May 05 '19 at 12:56
Break point is placed there. This shows that the productsHtml-list contains 0 elements. It also seemes that nothing gets returned from htmlDocument.DocumentNode.Descendants("table") (looking at the HtmlAgilityPack.HtmlNode.Descendants returned object in the debuger). Might this be related to the comments under the question, which is saying that the data may be generated by jacascript? — Snorre, May 05 '19 at 15:02

score 0 · Answer 2 · answered May 05 '19 at 21:18

If you don’t mind calling an external python program, I’d suggest looking at python and the library called “BeautifulSoup”. It parses html nicely. Have the python program write out an xml file that your application can deserialize... the c# program can then do whatever it needs to do using that deserialized structure.

Snorre · Accepted Answer · 2019-05-07T15:17:29.260

Thank you all for the feedback on this post, it has helped me find a solution to this problem.

It turned out that the data I wanted to retrieve was loaded with javascript. This means that the methods HtmlWeb and HtmlDocument from HtmlAgilityPack loads the html before the data I want has been loaded to the page, and these can thus not be used for this purpose.

I got around this by using a headless browser. I downloaded Chromdriver and Selenium via Nuget, and got the data I wanted by using the following code:

using OpenQA.Selenium.Chrome;

var chromeOptions = new ChromeOptions();
            chromeOptions.AddArguments("headless");

            using (var driver = new ChromeDriver(chromeOptions))
            {
                driver.Navigate().GoToUrl("https://fantasy.eliteserien.no/a/statistics/cost_change_start");

                // As IWebElement
                var fantasyTable = driver.FindElementByClassName("ism-scroll-table");

                // Content as text-string
                string fantasyTableText = fantasyTable.Text;

                // As Html-string
                string fantasyTableAsHtml = fantasyTable.GetAttribute("innerHTML");

                // My code for handling the data follows here...

            }

Resource used to solve this: How to start ChromeDriver in headless mode

Fetching data from a web page to a C# application

3 Answers3