9

Using the following code, I can download the HTML of a file from the internet:

WebClient wc = new WebClient();

// ....

string downloadedFile = wc.DownloadString("http://www.myurl.com/");

However, sometimes the file contains "interesting" characters like é to é, to ↠and フシギダネ to フシギダãƒ.

I think it may be something to do with different unicode types or something, as each character gets changed into 2 new ones, perhaps each character being split in half but I have very little knowledge in this area. What do you think is wrong?

Callum Rogers
  • 15,630
  • 17
  • 67
  • 90
  • 1
    The server likely returns a wrong encoding in the `Content-Type` header. – dtb Apr 23 '10 at 17:31
  • 4
    You should read [this article](http://www.joelonsoftware.com/articles/Unicode.html) to get some basic understanding on Unicode. It'll cover all the reasons why some items show up as two, for example. But importantly, it'll help you understand the basics you need to know about Unicode. – Grace Note Apr 23 '10 at 17:31
  • 1
    This pretty certainly UTF-8 HTML viewed in ISO-8859-1 or another single-byte encoding. – Pekka Apr 23 '10 at 17:35

3 Answers3

48

Here's a wrapped download class which supports gzip and checks encoding header and meta tags in order to decode it correctly.

Instantiate the class, and call GetPage().

public class HttpDownloader
{
    private readonly string _referer;
    private readonly string _userAgent;

    public Encoding Encoding { get; set; }
    public WebHeaderCollection Headers { get; set; }
    public Uri Url { get; set; }

    public HttpDownloader(string url, string referer, string userAgent)
    {
        Encoding = Encoding.GetEncoding("ISO-8859-1");
        Url = new Uri(url); // verify the uri
        _userAgent = userAgent;
        _referer = referer;
    }

    public string GetPage()
    {
        HttpWebRequest request = (HttpWebRequest)WebRequest.Create(Url);
        if (!string.IsNullOrEmpty(_referer))
            request.Referer = _referer;
        if (!string.IsNullOrEmpty(_userAgent))
            request.UserAgent = _userAgent;

        request.Headers.Add(HttpRequestHeader.AcceptEncoding, "gzip,deflate");

        using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
        {
            Headers = response.Headers;
            Url = response.ResponseUri;
            return ProcessContent(response);
        }

    }

    private string ProcessContent(HttpWebResponse response)
    {
        SetEncodingFromHeader(response);

        Stream s = response.GetResponseStream();
        if (response.ContentEncoding.ToLower().Contains("gzip"))
            s = new GZipStream(s, CompressionMode.Decompress);
        else if (response.ContentEncoding.ToLower().Contains("deflate"))
            s = new DeflateStream(s, CompressionMode.Decompress);  

        MemoryStream memStream = new MemoryStream();
        int bytesRead;
        byte[] buffer = new byte[0x1000];
        for (bytesRead = s.Read(buffer, 0, buffer.Length); bytesRead > 0; bytesRead = s.Read(buffer, 0, buffer.Length))
        {
            memStream.Write(buffer, 0, bytesRead);
        }
        s.Close();
        string html;
        memStream.Position = 0;
        using (StreamReader r = new StreamReader(memStream, Encoding))
        {
            html = r.ReadToEnd().Trim();
            html = CheckMetaCharSetAndReEncode(memStream, html);
        }            

        return html;
    }

    private void SetEncodingFromHeader(HttpWebResponse response)
    {
        string charset = null;
        if (string.IsNullOrEmpty(response.CharacterSet))
        {
            Match m = Regex.Match(response.ContentType, @";\s*charset\s*=\s*(?<charset>.*)", RegexOptions.IgnoreCase);
            if (m.Success)
            {
                charset = m.Groups["charset"].Value.Trim(new[] { '\'', '"' });
            }
        }
        else
        {
            charset = response.CharacterSet;
        }
        if (!string.IsNullOrEmpty(charset))
        {
            try
            {
                Encoding = Encoding.GetEncoding(charset);
            }
            catch (ArgumentException)
            {
            }
        }
    }

    private string CheckMetaCharSetAndReEncode(Stream memStream, string html)
    {
        Match m = new Regex(@"<meta\s+.*?charset\s*=\s*""?(?<charset>[A-Za-z0-9_-]+)""?", RegexOptions.Singleline | RegexOptions.IgnoreCase).Match(html);            
        if (m.Success)
        {
            string charset = m.Groups["charset"].Value.ToLower() ?? "iso-8859-1";
            if ((charset == "unicode") || (charset == "utf-16"))
            {
                charset = "utf-8";
            }

            try
            {
                Encoding metaEncoding = Encoding.GetEncoding(charset);
                if (Encoding != metaEncoding)
                {
                    memStream.Position = 0L;
                    StreamReader recodeReader = new StreamReader(memStream, metaEncoding);
                    html = recodeReader.ReadToEnd().Trim();
                    recodeReader.Close();
                }
            }
            catch (ArgumentException)
            {
            }
        }

        return html;
    }
}
Mikael Svenson
  • 39,181
  • 7
  • 73
  • 79
  • 1
    Something I wrote last year for an azure project :) Glad it could be of use for you. – Mikael Svenson Apr 23 '10 at 17:56
  • Thanks for sharing this Mikael. I've used it and I found a problem with the encoding detection. If headers contain `charset` it shouldn't check the meta tag since the precedence rules clearly states that in case of conflict header has the highest priority. http://goo.gl/5q0Yg – Diadistis Mar 20 '11 at 23:14
  • To solve this issue I've created a `encodingFoundInHeader` boolean field that gets set in `SetEncodingFromHeader` and if true prevents the call to `CheckMetaCharSetAndReEncode`. – Diadistis Mar 20 '11 at 23:18
  • That might be a good idea, but more often than not I have found the meta tags to be more correct than the headers. I wish this was easier and a 100% method :) – Mikael Svenson Mar 21 '11 at 07:33
  • Awesome answer, but know that (confusingly) HTTP ['deflate' isn't actually deflate (RFC 1951), but rather zlib (RFC 1950)](http://en.wikipedia.org/wiki/Gzip#Other_uses). Unless the .NET DeflateStream is exceptionally lenient, it won't correctly decompress `zlib` streams (on the other hand, the server might mistakenly send a raw deflate stream too!). I find it best to just not support HTTP deflate (as a client), just to avoid the ambiguity. – Cameron Dec 03 '11 at 20:52
  • Cameron, that's a good observation and something I haven't really tested. But at the time I used this code we were crawling tens of thousand different sites and never got an error as far as I remember. Deflate in .Net is gzip without a header if I'm not mistaken. But as you say, it might be better to only support gzip in the header. – Mikael Svenson Dec 04 '11 at 18:37
  • Why do you check for an empty `response.CharacterSet` and then check for `"charset"` strings manually? The source-code to `HttpWebResponse.get_CharacterSet` does the same `"charset"` check. – Dai Mar 04 '18 at 05:20
2

Since I am not allowed to comment (insufficient reputation), I'll have to post an additional answer. I am using Mikael's great class routinely, but I encountered a practical problem with the regex that tries to find the charset meta-info. This

Match m = new Regex(@"<meta\s+.*?charset\s*=\s*(?<charset>[A-Za-z0-9_-]+)", RegexOptions.Singleline | RegexOptions.IgnoreCase).Match(html); 

fails on this

<meta charset="UTF-8"/>

whereas this

Match m = new Regex(@"<meta\s+.*?charset\s*=\s*""?(?<charset>[A-Za-z0-9_-]+)""?", RegexOptions.Singleline | RegexOptions.IgnoreCase).Match(html);

does not.

Thanks, Mikael.

johelmuth
  • 23
  • 4
-5

Try this

string downloadedFile = wc.DownloadString("http://www.myurl.com");

i allways remove the last "Slash" and it worked till now like a charm. But i could be also a hazard