1

I'm using HtmlAgilityPack v1.11.21 and since upgrading to .NET Core 3.1, I started to receive the following error while trying to load up a web page via URL: 'UTF-8, text/html' is not a supported encoding name. For information on defining a custom encoding, see the documentation for the Encoding.RegisterProvider method. (Parameter 'name')

I found this post 'UTF8' is not a supported encoding name, but I'm not sure where or how I'm supposed to implement:

    System.Text.EncodingProvider provider = System.Text.CodePagesEncodingProvider.Instance;
    Encoding.RegisterProvider(provider);

I tried placing it before calling

   var web = new HtmlWeb();
   var doc = web.Load(urlToSearch);

But that didn't solve the issue.

This was working fine before upgrading to .NET Core 3.1, so I'm not sure where exactly I need to implement a fix.

Any ideas would be appreciated!

Thanks!

For those asking for the url, I'd rather not share that, but here's the heading:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <!-- Bootstrap -->
    <!-- Latest compiled and minified CSS -->
    <link rel="stylesheet" href="http://www.somesite.com/graphics/cdn/bootstrap-3.3.4-base-and-theme-min.2.css">
    <!-- Optional theme -->
    <link rel='stylesheet' type="text/css" media="screen" href="http://fonts.googleapis.com/css?family=Droid+Sans:400,700">
    <link rel="stylesheet" href="http://www.somesite.com/graphics/cdn/somesite-responsive.css">
    <link rel="apple-touch-icon-precomposed" sizes="57x57" href="/apple-touch-icon-57x57.png" />
    <link rel="apple-touch-icon-precomposed" sizes="114x114" href="/apple-touch-icon-114x114.png" />
    <link rel="apple-touch-icon-precomposed" sizes="72x72" href="/apple-touch-icon-72x72.png" />
    <link rel="apple-touch-icon-precomposed" sizes="144x144" href="/apple-touch-icon-144x144.png" />
    <link rel="apple-touch-icon-precomposed" sizes="60x60" href="/apple-touch-icon-60x60.png" />
    <link rel="apple-touch-icon-precomposed" sizes="120x120" href="/apple-touch-icon-120x120.png" />
    <link rel="apple-touch-icon-precomposed" sizes="76x76" href="/apple-touch-icon-76x76.png" />
    <link rel="apple-touch-icon-precomposed" sizes="152x152" href="/apple-touch-icon-152x152.png" />
    <link rel="icon" type="image/png" href="/favicon-196x196.png" sizes="196x196" />
    <link rel="icon" type="image/png" href="/favicon-96x96.png" sizes="96x96" />
    <link rel="icon" type="image/png" href="/favicon-32x32.png" sizes="32x32" />
    <link rel="icon" type="image/png" href="/favicon-16x16.png" sizes="16x16" />
    <link rel="icon" type="image/png" href="/favicon-128.png" sizes="128x128" />
    <meta name="application-name" content="&nbsp;" />
    <meta name="msapplication-TileColor" content="#FFFFFF" />
    <meta name="msapplication-TileImage" content="/mstile-144x144.png" />
    <meta name="msapplication-square70x70logo" content="/mstile-70x70.png" />
    <meta name="msapplication-square150x150logo" content="/mstile-150x150.png" />
    <meta name="msapplication-wide310x150logo" content="/mstile-310x150.png" />
    <meta name="msapplication-square310x310logo" content="/mstile-310x310.png" />
    <meta property="og:url" content="http://www.somesite.com/">
    <meta property="og:type" content="website">
    <meta property="og:title" content="site title">
    <meta property="og:image" content="http://www.somesite.com/graphics/somesite_square_logo.png">
    <meta property="og:description" content="description">
    <title>site title</title>
</head>
<body>
</body>
</html>

There doesn't look like there's anything special there. Was hoping it was a .NET Core 3.1 thing...

As another measure, I've tried implementing the below but the response.Content.ReadAsStringAsync() comes back as empty.

using var httpClient = new HttpClient();
{
    var response = await httpClient.GetAsync(urlToSearch);

    if (response.IsSuccessStatusCode)
    {
        var html = await response.Content.ReadAsStringAsync();

        var doc = new HtmlDocument();
        doc.LoadHtml(html);

        var photoUrl = doc.QuerySelector("div #headshot").ChildNodes[0].Attributes["src"].Value;

        return new OkObjectResult(photoUrl);
    }
}
AJ Tatum
  • 653
  • 2
  • 15
  • 35
  • 1
    Can you provide `urlToSearch` which caused the problem? I did some tests with few randomly selected URLs and they all worked fine (using .NET Core 3.1 and HtmlAgilityPack 1.11.21). – DK Dhilip Mar 02 '20 at 20:32
  • The error is pretty clear - that string, with that casing, isn't a valid encoding – Panagiotis Kanavos Mar 03 '20 at 17:31
  • ASP.NET always used UTF8 encoding. StackOverflow is an ASP.NET stie, recently migrated to ASP.NET Core, and *does* use UTF8, as you can verify simply by checking the encoding in your browser – Panagiotis Kanavos Mar 03 '20 at 17:33
  • The question you link to mentions another non-existent encoding - `UTF8` – Panagiotis Kanavos Mar 03 '20 at 17:34
  • 1) Does the code with `ReadAsStringAsync()` successes on other pages (that is until the point of `QuerySelector`, it's irrelevant)? 2) What are the HTTP headers? 3) Can you find the `UTF-8, text/html` string anywhere? – x00 Mar 04 '20 at 10:35
  • Also: https://stackoverflow.com/questions/46994907/encoding-registerprovidercodepagesencodingprovider-instance-does-not-add-extra maybe cleanup/reinstallation and rebuild is needed? – x00 Mar 04 '20 at 10:37
  • I think the issue is due to the site being protected by Cloudflare upon looking at the cookies of the site and further investigation. – AJ Tatum Mar 05 '20 at 13:52
  • Doesn't explain `UTF-8, text/html` though – x00 Mar 05 '20 at 15:55

1 Answers1

1

It looks like it's not the issue with .NET Core 3.1, but with the URL you are trying to load.

  1. .NET Core 3.1 has UTF-8 among defaults

    .NET Core, on the other hand, supports only the following encodings:

    • [...]
    • UTF-8 (code page 65001), which is returned by the Encoding.UTF8 property.
    • [...]
  2. I don't recall any place in HTTP Headers or in HTML where a string similar to

    UTF-8, text/html

    is expected.

    In headers it looks like:

    Content-Type: text/html;charset=utf-8
    

    In HTML, like:

    <meta charset="utf-8"/>
    

    or

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
    
  3. The page itself could show no sings of a problem in browsers, because they are quite forgiving. And your code before upgrade also could ignore the , text/html part for a ton of reasons. And the issue started to appear after upgrade... for another ton of reasons.
  4. If you do not control the server, then you probably should load the page manually, then remove this error (", text/html") from the string and feed the result to HtmlAgilityPack

Update

Considering your update:

  1. HTTP headers are also important. Even more. They take precedence over <meta>. Try
    curl -v yourURL
    
  2. Not sure about ReadAsStringAsync returning an empty string: maybe it's the same issue - wrong headers, or it may be an error in your code (as far as I know, ReadAsStringAsync doesn't really returns a string). You can try passing the HTML as static string
    html = "<!DOCTYPE html>...";
    doc.LoadHtml(html);
    
    To isolate the initial issue.
  3. As for ReadAsStringAsync you should check first if it succeeds reading other sites. I looked on the Internet... there are a lot of possibilities. Don't know what will work for you.
  4. If the issue is with the headers. Then you can try this Is it possible to make HttpClient ignore invalid ETag header in response? or this https://stackoverflow.com/a/29426046/12610347 or this How to make HttpClient ignore Content-Length header or something else to your liking.
x00
  • 13,643
  • 3
  • 16
  • 40