1

I'm using Webclient and DownloadString to scrape massive number of urls. The purpose is to get the page source of each url.

However, some urls redirected to download large file and Webclient throw error:

Out of memory

So, I do not want to download large file. How can I set the maximum download size for Webclient?

  • 2
    http://stackoverflow.com/questions/2616358/limit-webclient-downloadfile-maximum-file-size – Blorgbeard Jan 06 '17 at 01:45
  • The question seems to be confusing whether you want to download the large files or you want to set the download size and if the file size exceeds you dont want to download it. whereas i came up with [OutOfMemoryException in WebClient](http://stackoverflow.com/a/15163532/3796048) – Mohit S Jan 06 '17 at 01:49

1 Answers1

1

You can do that with the WebClient but it abstracts so much details so it might be more convenient to take control with the WebRequest classes direct, that are also used by the WebClient itself btw.

Instead use the underling HttpWebRequest and HttpWebResponse and then stop reading the ResponseStream once you reached the character limit.

Your method would be like this:

public static string DownloadAsString(string url)
{
    string pageSource = String.Empty;
    var req = (HttpWebRequest)WebRequest.Create(url);
    req.Method = "GET";
    req.UserAgent = "MyCrawler/1.0";
    req.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;
    var resp = (HttpWebResponse)req.GetResponse();

    // is this even html and not an image, or video       
    if (resp.ContentType.Contains("text/html"))
    {
        var sb = new StringBuilder();
        var buffer = new char[8192];
        // get the stream
        using (var stream = resp.GetResponseStream())
        using (var sr = new StreamReader(stream, Encoding.UTF8))
        {
            // start copying in blocks of 8K
            var read = sr.ReadBlock(buffer, 0, buffer.Length);
            while (read > 0)
            {
                sb.Append(buffer);
                // max allowed chars per source
                if (sb.Length > 50000)
                {
                    sb.Append(" ... source truncated due to size");
                    // stop early 
                    break;
                }
                read = sr.ReadBlock(buffer, 0, buffer.Length);
            }
            pageSource = sb.ToString();
        }
    }
    return pageSource;
}

And you use this method like so:

var src = DownloadAsString("http://stackoverflow.com");
Console.WriteLine(src);

and this will output the html that makes up the front page of Stack Overflow. Notice that this output is capped at 50.000 chars, so it will show ... source truncated due to size at the end of the src string.

rene
  • 41,474
  • 78
  • 114
  • 152