6

The webpage use javascript to build its html so I need html parser with js support.
I found anglesharp but I can't make it working.

using AngleSharp;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Net;
using System.Text;
using System.Threading.Tasks;

namespace AngleSharpScraping
{
    class Program
    {
        static void Main(string[] args)
        {
            GetMkvToolNix();
            Console.ReadKey();
        }

        static async void GetMkvToolNix()
        {
            // Create a new configuration with javascript interpreter.
            var config = new Configuration().WithJavaScript();

            // Parsing process.
            var document = await BrowsingContext.New(config).OpenAsync(Url.Create("http://www.fosshub.com/MKVToolNix.html"));
            var link = document.QuerySelector("body > div.container.page-content > div > div.col-sm-9 > article > div.main-dl-box > p:nth-child(2) > a.dwl-link.xlink").GetAttribute("data");

            Console.WriteLine(link);
        }
    }
}
Lucas Trzesniewski
  • 50,214
  • 11
  • 107
  • 158
baltazer
  • 259
  • 1
  • 5
  • 12

2 Answers2

5

AngleSharp alone only provides an HTML and CSS parser. However, AngleSharp may be extended with JavaScript capabilities. Right now the package you've used (AngleSharp.Scripting.JavaScript) is experimental and more or less a proof of concept.

The JavaScript files on the page are still too complex for the experimental support. It is my effort to enable support for such scenarios as soon as possible, but right now I would say that WebKit.NET is probably your best shot for solving your problem.

Another possible solution might be to use the C# driver for Selenium.

Unrelated to the whole JavaScript topic: If you want to load external resources you need to provide a proper (http) requester. The easiest way to do that is by using the default one:

var config = new Configuration().WithDefaultLoader();
var document = await BrowsingContext.New(config).OpenAsync("http://www.fosshub.com/MKVToolNix.html");
// ...

In this setting external documents are loaded, but other resources (e.g., images, scripts, ...) are not loaded.

Florian Rappl
  • 3,041
  • 19
  • 25
  • I had some weird problems with selenium before, like showing browser error or showing firewall access dialog. WebKit.NET seems unmaintained or dead. On NuGet I find CefSharp but I found it overcomplicated. – baltazer Jun 08 '15 at 12:14
  • I see, well, hang in there and I try to improve the support for JS. Only time is a limited resource here. – Florian Rappl Jun 08 '15 at 12:42
  • Did anyone get `Method 'EvaluateScriptAsync' in type 'AngleSharp.Scripting.JavaScript.JavaScriptEngine' from assembly 'AngleSharp.Scripting.JavaScript, Version=0.3.1.26954, Culture=neutral, PublicKeyToken=null' does not have an implementation.` ? by simply running `new Configuration().WithJavaScript()` ? Am getting the same thing using `Configuration.Default.WithJavaScript()` – Veverke Apr 07 '16 at 13:25
3

AngleSharp is a text parser. If you want to scrape dynamic web pages with JS, you'll need a headless browser.

This answer provides a couple of options (at least one free and open source: WebKit.NET).

Community
  • 1
  • 1
zlumer
  • 6,844
  • 1
  • 25
  • 25
  • 6
    AngleSharp executes JavaScript with Jint: `var config = new Configuration().WithJavaScript();` The BrowsingContext must act like real browser with session and cookie handling. – baltazer Jun 07 '15 at 18:48