15

I am writing a program to download html page from other website. I found a problem that for some particular website, I cannot get the full html code. And I only can get partial content. The server with this problem are sending data in "Transfer-Encoding:chunked" I am afraid this is the reason of the problem.

This the header information returned by server:

Transfer-Encoding: chunked
Connection: keep-alive
Pragma: no-cache
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Content-Type: text/html; charset=UTF-8
Date: Sun, 11 Sep 2011 09:46:23 GMT
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Server: nginx/1.0.6

Here is my code:

HttpWebRequest request = WebRequest.Create(url) as HttpWebRequest;
HttpWebResponse response;
CookieContainer cookie = new CookieContainer();
request.CookieContainer = cookie;
request.AllowAutoRedirect = true;
request.KeepAlive = true;
request.UserAgent =
    @"Mozilla/5.0 (Windows NT 6.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2 FirePHP/0.6";
request.Accept = @"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
string html = string.Empty;
response = request.GetResponse() as HttpWebResponse;

using (StreamReader reader = new StreamReader(response.GetResponseStream()))
{
    html = reader.ReadToEnd();
}

I can only get partial html code ( I think it is the first chunk from the server). Could anyone help? Any Solution?

Thanks!

svick
  • 236,525
  • 50
  • 385
  • 514
syking
  • 223
  • 2
  • 3
  • 6
  • HttpWebResponse already knows how to deal with chunked data. What you cannot ignore is the ContentEncoding. You assume utf8 in your StreamReader constructor call, this will go wrong when it is not. – Hans Passant Sep 11 '11 at 12:55
  • Hi Hans Passant, Thanks for your comments. I confirm that the web page is with UTF8 encoding. I try to change the encoding setting in StreamReader to ASCII, still the same result, if change to Unicode, all content are unreadable code. – syking Sep 12 '11 at 04:14
  • @HansPassant I have the same problem apparently but passing the encoding in the StreamReader constructor doesn't seem to help. I also tried copying the ResponseStream to a MemoryStream and create a StreamReader for all the possible encoding and none of them seemed to be able to fully dump all the chunks. Any Idea? –  May 02 '14 at 10:30

2 Answers2

9

You can't use ReadToEnd to read chunked data. You need to read directly from the response stream using GetBytes.

StringBuilder sb = new StringBuilder();
Byte[] buf = new byte[8192];
Stream resStream = response.GetResponseStream();

do
{
     count = resStream.Read(buf, 0, buf.Length);
     if(count != 0)
     {
          sb.Append(Encoding.UTF8.GetString(buf,0,count)); // just hardcoding UTF8 here
     }
}while (count > 0);
String html = sb.ToString();
Strelok
  • 50,229
  • 9
  • 102
  • 115
  • This answer is working but it has missing code. The variable named 'count' has not defined. If you define the variable and set value with buf.Length than adding 'count--' in the while loop, it'll work. – bafsar Oct 20 '14 at 19:03
  • @bafsar It should better be done using response.ContentLength as follows Byte[] buf = new byte[response.ContentLength]; to get the correct buffer length – Redeemed1 Oct 20 '15 at 10:22
  • @Redeemed1 there isn't ContentLength set with Transfer-Encoding: chunked – George Chondrompilas Jan 24 '16 at 23:15
  • Looking at various answers I noticed that this one -- http://stackoverflow.com/questions/16998/reading-chunked-response-with-httpwebresponse/17236#17236 -- is nearly identical code. Is the example code here a copy/paste from that answer? If so, shouldn't there be an attribution for the source, even if it was slightly modified? – Solomon Rutzky Oct 10 '16 at 17:05
-1

if I've understood what you're asking you can do it reading line by line

string htmlLine = reader.ReadLine();
Flexo
  • 87,323
  • 22
  • 191
  • 272
gsscoder
  • 3,088
  • 4
  • 33
  • 49