C# WebClient - DownloadString bad encoding

Question

I'm trying to download an html document from Amazon but for some reason I get a bad encoded string like "��K��g��g�e".

Here's the code I tried:

using (var webClient = new System.Net.WebClient())
{
    var url = "https://www.amazon.com/dp/B07H256MBK/";
    webClient.Encoding = Encoding.UTF8;
    var result = webClient.DownloadString(url);
}

Same thing happens when using HttpClient:

var url = "https://www.amazon.com/dp/B07H256MBK/";
var httpclient = new HttpClient();
var html = await httpclient.GetStringAsync(url);

I also tried reading the result in Bytes and then convert it back to UTF-8 but I still get the same result. Also note that this DOES NOT always happen. For example, yesterday I was running this code for ~2 hours and I was getting a correctly encoded HTML document. However today I always get a bad encoded result. It happens every other day so it's not a one time thing.

==================================================================

However when I use the HtmlAgilitypack's wrapper it works as expected everytime:

var url = "https://www.amazon.com/dp/B07H256MBK/";
HtmlWeb htmlWeb = new HtmlWeb();
HtmlDocument doc = htmlWeb.Load(url);

What causes the WebClient and HttpClient to get a bad encoded string even when I explicitly define the correct encoding? And how does the HtmlAgilityPack's wrapper works by default?

Thanks for any help!

Check the "Content-Type" header, it contains the used encoding. Do not take UTF8 for granted. https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Type — jgauffin, Mar 19 '20 at 09:15

canton7 · Accepted Answer · 2020-03-19T09:22:04.137

I fired up Firefox's web dev tools, requested that page, and looked at the response headers:

See that content-encoding: gzip? That means the response is gzip-encoded.

It turns out that Amazon gives you a response compressed with gzip even when you don't send an Accept-Encoding: gzip header (verified with another tool). This is a bit naughty, but not that uncommon, and easy to work around.

This wasn't a problem with character encodings at all. HttpClient is good at figuring out the correct encoding from the Content-Type header.

You can tell HttpClient to un-zip responses with:

HttpClientHandler handler = new HttpClientHandler()
{
    AutomaticDecompression = DecompressionMethods.GZip,
};

using (var client = new HttpClient(handler))
{
    // your code
}

This will be set automatically if you're using the NuGet package versions 4.1.0 to 4.3.2, otherwise you'll need to do it yourself.

You can do the same with WebClient, but it's harder.

Worked like a charm! Thanks a lot! – knewit Mar 21 '20 at 04:59 — knewit, Mar 21 '20 at 04:59

C# WebClient - DownloadString bad encoding

1 Answers1

Linked

Related