29

Here is a snippet of the code :

HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(request.RawUrl);
WebRequest.DefaultWebProxy = null;//Ensure that we will not loop by going again in the proxy
HttpWebResponse response = (HttpWebResponse)webRequest.GetResponse();
string charSet = response.CharacterSet;
Encoding encoding;
if (String.IsNullOrEmpty(charSet))
encoding = Encoding.Default;
else
encoding = Encoding.GetEncoding(charSet);

StreamReader resStream = new StreamReader(response.GetResponseStream(), encoding);
return resStream.ReadToEnd();

The problem is if I test with : http://www.google.fr

All "é" are not displaying well. I have try to change ASCII to UTF8 and it still display wrong. I have tested the html file in a browser and the browser display the html text well so I am pretty sure the problem is in the method I use to download the html file.

What should I change?

removed dead ImageShack link

Update 1: Code and test file changed

Community
  • 1
  • 1
Patrick Desjardins
  • 136,852
  • 88
  • 292
  • 341
  • "é" should still work, even in ASCII. Are you outputting to a file and determining that its not working, or break pointing on the returned sb.ToString() and viewing it in Quick Watch and determining that it failed? – cfeduke Oct 22 '08 at 21:19
  • 8
    No, an acute accent would never work in ASCII, which only contains Unicode up to 127. – Jon Skeet Oct 22 '08 at 21:23
  • 3
    (Just in case anyone feels like contradicting that and talking about "extended ASCII" - see http://msdn.microsoft.com/en-us/library/system.text.encoding.ascii.aspx) – Jon Skeet Oct 22 '08 at 21:24
  • 1
    What about the zabulus answer here? Looks much simpler: http://stackoverflow.com/questions/7634113/is-it-possible-to-get-data-from-web-response-in-a-right-encoding – Ignacio Soler Garcia May 10 '12 at 09:31
  • It's pretty much what Jon as answered 4 years ago :) – Patrick Desjardins May 10 '12 at 18:39

7 Answers7

29

CharacterSet is "ISO-8859-1" by default, if it is not specified in server's content type header (different from "charset" meta tag in HTML). I compare HttpWebResponse.CharacterSet with charset attribute of HTML. If they are different - I use the charset as specified in HTML to re-read the page again, but with correct encoding this time.

See the code:

    string strWebPage = "";
    // create request
    System.Net.WebRequest objRequest = System.Net.HttpWebRequest.Create(sURL);
    // get response
    System.Net.HttpWebResponse objResponse;
    objResponse = (System.Net.HttpWebResponse)objRequest.GetResponse();
    // get correct charset and encoding from the server's header
    string Charset = objResponse.CharacterSet;
    Encoding encoding = Encoding.GetEncoding(Charset);
    // read response
    using (StreamReader sr = 
           new StreamReader(objResponse.GetResponseStream(), encoding))
    {
        strWebPage = sr.ReadToEnd();
        // Close and clean up the StreamReader
        sr.Close();
    }

    // Check real charset meta-tag in HTML
    int CharsetStart = strWebPage.IndexOf("charset=");
    if (CharsetStart > 0)
    {
        CharsetStart += 8;
        int CharsetEnd = strWebPage.IndexOfAny(new[] { ' ', '\"', ';' }, CharsetStart);
        string RealCharset = 
               strWebPage.Substring(CharsetStart, CharsetEnd - CharsetStart);

        // real charset meta-tag in HTML differs from supplied server header???
        if(RealCharset!=Charset)
        {
            // get correct encoding
            Encoding CorrectEncoding = Encoding.GetEncoding(RealCharset);

            // read the web page again, but with correct encoding this time
            //   create request
            System.Net.WebRequest objRequest2 = System.Net.HttpWebRequest.Create(sURL);
            //   get response
            System.Net.HttpWebResponse objResponse2;
            objResponse2 = (System.Net.HttpWebResponse)objRequest2.GetResponse();
            //   read response
            using (StreamReader sr = 
              new StreamReader(objResponse2.GetResponseStream(), CorrectEncoding))
            {
                strWebPage = sr.ReadToEnd();
                // Close and clean up the StreamReader
                sr.Close();
            }
        }
    }
Alex Dubinsky
  • 299
  • 3
  • 2
  • 2
    I think this should be marked as answer. This actually gets the encoding from any webpage and decodes them properly. But the problem is this does not work in Windows phone since its response implementation does not support Response.CharacterSet – Adarsha Jan 03 '12 at 01:45
  • Excellent! Exactly what I was looking for. I already had a loop to retry for unexpected errors, so I just needed to convert charset and realcharset to local variables to avoid the extra declarations of requests. – ThunderGr Oct 29 '13 at 10:49
  • Well, it's 2020 and this isn't true anymore. In fact, it's getting VERY complicated. For a full summary of this beast, check out [this answer](https://stackoverflow.com/a/44422343/656243). TL; dr: RFC 7231 now says there is no defined encoding, unless you are XML content, in which case, it's us-ascii. But of course, there's more to it than that. – Lynn Crumbling Aug 27 '20 at 20:53
27

Firstly, the easier way of writing that code is to use a StreamReader and ReadToEnd:

HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(myURL);
using (HttpWebResponse response = (HttpWebResponse)webRequest.GetResponse())
{
    using (Stream resStream = response.GetResponseStream())
    {
        StreamReader reader = new StreamReader(resStream, Encoding.???);
        return reader.ReadToEnd();
    }
}

Then it's "just" a matter of finding the right encoding. How did you create the file? If it's with Notepad then you probably want Encoding.Default - but that's obviously not portable, as it's the default encoding for your PC.

In a well-run web server, the response will indicate the encoding in its headers. Having said that, response headers sometimes claim one thing and the HTML claims another, in some cases.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • In fact, I am trying to get file all over the world and I got some bad output (PNG file wasn't properly formed) and text was badly written (all character like "é"). – Patrick Desjardins Oct 22 '08 at 21:25
  • 1
    If you're trying to read arbitrary HTML, you'll need to examine the headers and sometimes the start of the HTML (which can advertise the encoding just like XML does). Sometimes you then have to detect that it's probably not right and guess by heuristics anyway! – Jon Skeet Oct 22 '08 at 21:27
  • Ok, I'll take a look to the header. I have playing with you code and StreamReader(resStream, true) doesn't work (supposed to find the encoding with the byte...) I'll try to get it from the header. I'll post later. – Patrick Desjardins Oct 22 '08 at 21:30
  • Just curious if you've followed how complicated "determining what charset to use when the server doesn't supply one" has gotten over the years (see [my comment](https://stackoverflow.com/questions/227575/encoding-trouble-with-httpwebresponse#comment112507668_4229277) to Alex's answer.) – Lynn Crumbling Aug 27 '20 at 20:58
  • @LynnCrumbling: Nope, haven't really. – Jon Skeet Aug 28 '20 at 06:15
17

In case you don't want to download the page twice, I slightly modified Alex's code using How do I put a WebResponse into a memory stream?. Here's the result

public static string DownloadString(string address)
{
    string strWebPage = "";
    // create request
    System.Net.WebRequest objRequest = System.Net.HttpWebRequest.Create(address);
    // get response
    System.Net.HttpWebResponse objResponse;
    objResponse = (System.Net.HttpWebResponse)objRequest.GetResponse();
    // get correct charset and encoding from the server's header
    string Charset = objResponse.CharacterSet;
    Encoding encoding = Encoding.GetEncoding(Charset);

    // read response into memory stream
    MemoryStream memoryStream;
    using (Stream responseStream = objResponse.GetResponseStream())
    {
        memoryStream = new MemoryStream();

        byte[] buffer = new byte[1024];
        int byteCount;
        do
        {
            byteCount = responseStream.Read(buffer, 0, buffer.Length);
            memoryStream.Write(buffer, 0, byteCount);
        } while (byteCount > 0);
    }

    // set stream position to beginning
    memoryStream.Seek(0, SeekOrigin.Begin);

    StreamReader sr = new StreamReader(memoryStream, encoding);
    strWebPage = sr.ReadToEnd();

    // Check real charset meta-tag in HTML
    int CharsetStart = strWebPage.IndexOf("charset=");
    if (CharsetStart > 0)
    {
        CharsetStart += 8;
        int CharsetEnd = strWebPage.IndexOfAny(new[] { ' ', '\"', ';' }, CharsetStart);
        string RealCharset =
               strWebPage.Substring(CharsetStart, CharsetEnd - CharsetStart);

        // real charset meta-tag in HTML differs from supplied server header???
        if (RealCharset != Charset)
        {
            // get correct encoding
            Encoding CorrectEncoding = Encoding.GetEncoding(RealCharset);

            // reset stream position to beginning
            memoryStream.Seek(0, SeekOrigin.Begin);

            // reread response stream with the correct encoding
            StreamReader sr2 = new StreamReader(memoryStream, CorrectEncoding);

            strWebPage = sr2.ReadToEnd();
            // Close and clean up the StreamReader
            sr2.Close();
        }
    }

    // dispose the first stream reader object
    sr.Close();

    return strWebPage;
}
Community
  • 1
  • 1
Eddo
  • 171
  • 1
  • 2
  • 3
    .NET 4 and later should have a Stream.CopyTo(Stream) method to simplify that. – Manny Jun 28 '12 at 07:35
  • why do you have to set the buffer size to 1024? Is it not possible to read the whole stream in one go? and why 1024? Why don't you set it larger? – Hoy Cheung May 03 '13 at 11:57
3

There are some good solutions here, but they all seem to be trying to parse the charset out of the content type string. Here's a solution using System.Net.Mime.ContentType, which should be more reliable, and shorter.

 var client = new System.Net.WebClient();
 var data = client.DownloadData(url);
 var encoding = System.Text.Encoding.Default;
 var contentType = new System.Net.Mime.ContentType(client.ResponseHeaders[HttpResponseHeader.ContentType]);
 if (!String.IsNullOrEmpty(contentType.CharSet))
 {
      encoding = System.Text.Encoding.GetEncoding(contentType.CharSet);
 }
 string result = encoding.GetString(data);
stephenr85
  • 151
  • 1
  • 4
1

This is code that download one time.

String FinalResult = "";
HttpWebRequest Request = (HttpWebRequest)System.Net.WebRequest.Create( URL );
HttpWebResponse Response = (HttpWebResponse)Request.GetResponse();
Stream ResponseStream = Response.GetResponseStream();
StreamReader Reader = new StreamReader( ResponseStream );

bool NeedEncodingCheck = true;

while( true )
{
    string NewLine = Reader.ReadLine(); // it may not working for zipped HTML.
    if( NewLine == null )
    {
        break;
    }

    FinalResult += NewLine;
    FinalResult += Environment.NewLine;

    if( NeedEncodingCheck )
    {
        int Start = NewLine.IndexOf( "charset=" );
        if( Start > 0 )
        {
            Start += "charset=\"".Length;   
            int End = NewLine.IndexOfAny( new[] { ' ', '\"', ';' }, Start );

            Reader = new StreamReader( ResponseStream, Encoding.GetEncoding(
                NewLine.Substring( Start, End - Start ) ) ); // Replace Reader with new encoding.

            NeedEncodingCheck = false;
        }
    }
}

Reader.Close();
Response.Close();
KinBread
  • 11
  • 2
0

I studied the same problem with the help of WireShark, a great protocol analyser. I think that there are some design short coming to the httpWebResponse class. In fact, the whole message entity was downloaded the first time you invoking the GetResponse() method of the HttpWebRequest class, but the framework have no place to hold the data in the HttpWebResponse class or somewhere else, resulting you have to get the response stream the second time.

0

There is still some problems when requesting the web page "www.google.fr" from a WebRequest.

I checked the raw request and response with Fiddler. The problem comes from Google servers. The response HTTP headers are set to charset=ISO-8859-1, the text itself is encoded with ISO-8859-1, while the HTML says charset=UTF-8. This is incoherent and lead to encoding errors.

After many tests, I managed to find a workaround. Just add :

myHttpWebRequest.UserAgent = "Mozilla/5.0";

to your code, and Google Response will magically and entirely become UTF-8.

Etienne Coumont
  • 518
  • 7
  • 11