Can't download HTML data from https URL using htmlagilitypack

Question

I have a "small" problem htmlagilitypack(HAP). When I tried to get data from a website I get this error:

An unhandled exception of type 'System.ArgumentException' occurred in mscorlib.dll

Additional information: 'gzip' is not a supported encoding name. For information on defining a custom encoding, see the documentation for the Encoding.RegisterProvider method.

I'm using this piece of code to get the data from the website:

HtmlWeb page = new HtmlWeb();
var url = "https://kat.cr/";
var data = page.Load(url);

After this code i get that error. I tried everything from the google but nothing helped.

Can someone tell me how to resolve this problem ?

Thank you

score 13 · Answer 1 · answered Jun 19 '16 at 19:21

You can intercept the request when using HtmlWeb to modify it based on your requirements.

var page = new HtmlWeb()
{
  PreRequest = request =>
  {
    // Make any changes to the request object that will be used.
    request.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;
    return true;
  }
};

var url = "https://kat.cr/";
var data = page.Load(url);

score 9 · Accepted Answer · edited May 23 '17 at 12:17

9

HtmlWeb doesn't support downloading from https. So instead, you can use WebClient with a bit of modification to automatically decompress GZip :

class MyWebClient : WebClient
{
    protected override WebRequest GetWebRequest(Uri address)
    {
        HttpWebRequest request = base.GetWebRequest(address) as HttpWebRequest;
        request.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;
        return request;
    }
}

Then use HtmlDocument.LoadHtml() to populate your HtmlDocument instance from HTML string :

var url = "https://kat.cr/";
var data = new MyWebClient().DownloadString(url);
var doc = new HtmlDocument();
doc.LoadHtml(data);

edited May 23 '17 at 12:17

Community

1
1

answered Mar 25 '16 at 13:31

har07

88,338
12
84
137

Thank you for your help, worked. Now I have one more question, I have something like this: `//*[@id=\"torrent_age_of_ultron11227701\"]/td/div/div/a //*[@id=\"torrent_age_of_ultron11227702\"]/td/div/div/a //*[@id=\"torrent_age_of_ultron11227731\"]/td/div/div/a //*[@id=\"torrent_age_of_ultron11227755\"]/td/div/div/a //*[@id=\"torrent_age_of_ultron11227766\"]/td/div/div/a //*[@id=\"torrent_age_of_ultron112277771\"]/td/div/div/a` It's there any command to get the XPATH: `//*[@id=\"torrent_age_of_ultron(like a regex here)\"]/td/div/div/a` – Valentin Pifu Mar 25 '16 at 14:59
@ValentinPifu XPath 1.0 which HtmlAgilityPack uses under the hood doesn't support regex. Maybe XPath `starts-with()` function is enough? Anyway, that is an entirely different topic from the original question. So, I'd suggest to post another question for that if you can't find a solution. Thanks – har07 Mar 26 '16 at 01:40

Can't download HTML data from https URL using htmlagilitypack

2 Answers2

Linked