67

I have an issue with some content that we are downloading from the web for a screen scraping tool that I am building.

in the code below, the string returned from the web client download string method returns some odd characters for the source download for a few (not all) web sites.

I have recently added http headers as below. Previously the same code was called without the headers to the same effect. I have not tried variations on the 'Accept-Charset' header, I don't know much about text encoding other than the basics.

The characters, or character sequences that I refer to are:

""

and

"Â"

These characters are not seen when you use "view source" in a web browser. What could be causing this and how can I rectify the problem?

string urlData = String.Empty;
WebClient wc = new WebClient();

// Add headers to impersonate a web browser. Some web sites 
// will not respond correctly without these headers
wc.Headers.Add("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.12) Gecko/20101026 Firefox/3.6.12");
wc.Headers.Add("Accept", "*/*");
wc.Headers.Add("Accept-Language", "en-gb,en;q=0.5");
wc.Headers.Add("Accept-Charset", "ISO-8859-1,utf-8;q=0.7,*;q=0.7");

urlData = wc.DownloadString(uri);
Konamiman
  • 49,681
  • 17
  • 108
  • 138
gbro3n
  • 6,729
  • 9
  • 59
  • 100
  • I'm getting the same error with a Windows Application I wrote to minify my JS and CSS using C# YUI Compessor. It throws errors on the files that come back with the exact chars you mention above. I specify `client.Encoding = Encoding.UTF8;` and it still returns funky chars... Also I'm trying to figure out how to handle the errors thrown by the C# YUI Compressor such as [ERROR] Invalid Syntax... – pixelbobby Feb 10 '11 at 19:47
  • It's a while since I first came across this issue and have learned a fir bit about text encoding since. To help you out, basically what you need to do is try and match the encoding from the http headers come with the response. From there decode the byte stream using the detected encoding. If the encoding is not included with the headers, decode with UTF8, and then look for an encoding in the HTML document. If there is still not one in the HTML document, you are only left with heuristics. I have read about various mechanisms, but no easy solution here. – gbro3n Nov 16 '11 at 21:28
  • I'll post some code back here next time I get the chance. – gbro3n Nov 16 '11 at 21:28
  • In my case the data returned was gzipped and had to be decompressed first, so I found this answer helpful: https://stackoverflow.com/a/34418228/74585 – Matthew Lock Feb 19 '16 at 01:20

5 Answers5

104

 is the windows-1252 representation of the octets EF BB BF. That's the UTF-8 byte-order marker, which implies that your remote web page is encoded in UTF-8 but you're reading it as if it were windows-1252. According to the docs, WebClient.DownloadString uses Webclient.Encoding as its encoding when it converts the remote resource into a string. Set it to System.Text.Encoding.UTF8 and things should theoretically work.

dkarp
  • 14,483
  • 6
  • 58
  • 65
  • 1
    Thanks, although this produces problems on other websites. Now I see a diamond with a question mark in it. I guess I'm specifying an encoding in the http header, so I should expect the same back from the web server? – gbro3n Jan 17 '11 at 18:51
  • 1
    Regardless of what you specify in the header, web servers can ignore it and return anything. You have to be prepared to deal with asking for UTF-8 and getting Windows encodings. – Dour High Arch Jan 17 '11 at 19:20
  • 2
    If you don't know which encoding the data will be coming back in, you can play it safe and get the raw bytes using [`WebClient.DownloadData`](http://msdn.microsoft.com/en-us/library/ms144188.aspx). – dkarp Jan 17 '11 at 19:27
  • 2
    dkarp - Wouldn't I still have to convert the byte stream into something intelligible using an encoding (which as I understand there is no way to detect)? – gbro3n Jan 17 '11 at 19:43
  • I actually found reference on the web to a bug in .net 3.5 and found that there is. The same site with the same code in .net 4 doesn't produce the same character sequence for this particular site. I have experimented with using webrequest instead, which does produce different results, though not sure if necessarily better. – gbro3n Jan 17 '11 at 19:44
  • Image does not get loaded using this. I am simply rendring google.com. It shows Google image blank. Please help –  Oct 13 '11 at 12:51
51

The way WebClient.DownloadString is implemented is very dumb. It should get the character encoding from the Content-Type header in the response, but instead it expects the developer to tell the expected encoding beforehand. I don't know what the developers of this class were thinking.

I have created an auxiliary class that retrieves the encoding name from the Content-Type header of the response:

public static class WebUtils
{
    public static Encoding GetEncodingFrom(
        NameValueCollection responseHeaders,
        Encoding defaultEncoding = null)
    {
        if(responseHeaders == null)
            throw new ArgumentNullException("responseHeaders");

        //Note that key lookup is case-insensitive
        var contentType = responseHeaders["Content-Type"];
        if(contentType == null)
            return defaultEncoding;

        var contentTypeParts = contentType.Split(';');
        if(contentTypeParts.Length <= 1)
            return defaultEncoding;

        var charsetPart =
            contentTypeParts.Skip(1).FirstOrDefault(
                p => p.TrimStart().StartsWith("charset", StringComparison.InvariantCultureIgnoreCase));
        if(charsetPart == null)
            return defaultEncoding;

        var charsetPartParts = charsetPart.Split('=');
        if(charsetPartParts.Length != 2)
            return defaultEncoding;

        var charsetName = charsetPartParts[1].Trim();
        if(charsetName == "")
            return defaultEncoding;

        try
        {
            return Encoding.GetEncoding(charsetName);
        }
        catch(ArgumentException ex) 
        {
            throw new UnknownEncodingException(
                charsetName,   
                "The server returned data in an unknown encoding: " + charsetName, 
                ex);
        }
    }
}

(UnknownEncodingException is a custom exception class, feel free to replace for InvalidOperationException or whatever else if you want)

Then the following extension method for the WebClient class will do the trick:

public static class WebClientExtensions
{
    public static string DownloadStringAwareOfEncoding(this WebClient webClient, Uri uri)
    {
        var rawData = webClient.DownloadData(uri);
        var encoding = WebUtils.GetEncodingFrom(webClient.ResponseHeaders, Encoding.UTF8);
        return encoding.GetString(rawData);
    }
}

So in your example you would do:

urlData = wc.DownloadStringAwareOfEncoding(uri);

...and that's it.

Konamiman
  • 49,681
  • 17
  • 108
  • 138
  • 4
    After 4 years such a good answer? Man, just because of that you deserve my vote, nice effort. – Yaroslav May 05 '15 at 10:25
  • I believe this is not true. DownloadString does uses the encoding from the Content-Type HTTP header, check out the source: http://referencesource.microsoft.com/#System/net/System/Net/webclient.cs,fd125940dd542ee8,references – Simon Mourier Sep 19 '15 at 10:23
  • 2
    According to the source, `DownloadString` tries to get character encoding using `Content-Type` header from the request, not the response. That's why Konamiman's extension works fine while `DownloadString` doesn't – holdenmcgrohen Dec 14 '15 at 10:10
18
var client = new WebClient { Encoding = System.Text.Encoding.UTF8 };

var json = client.DownloadString(url);
Sanket Patel
  • 901
  • 9
  • 21
1

None of them didn't work for me for some special websites such as "www.yahoo.com". The only way which I resolve my problem was changing DownloadString to OpenRead and using UserAgent header like sample code. However, a few sites like "www.varzesh3.com" didn't work with any of methods!

WebClient client = new WebClient()    
client.Headers.Add(HttpRequestHeader.UserAgent, "");
var stream = client.OpenRead("http://www.yahoo.com");
StreamReader sr = new StreamReader(stream);
s = sr.ReadToEnd();
Siamak Ferdos
  • 3,181
  • 5
  • 29
  • 56
0

in my case , i deleted ever header related to language ,charset etc EXcept user agent and cookie . it worked..

 // try commenting
 //wc.Headers.Add("Accept-Language", "en-gb,en;q=0.5");
 //wc.Headers.Add("Accept-Charset", "ISO-8859-1,utf-8;q=0.7,*;q=0.7");
bh_earth0
  • 2,537
  • 22
  • 24