You can do that with the WebClient but it abstracts so much details so it might be more convenient to take control with the WebRequest classes direct, that are also used by the WebClient itself btw.
Instead use the underling HttpWebRequest and HttpWebResponse and then stop reading the ResponseStream once you reached the character limit.
Your method would be like this:
public static string DownloadAsString(string url)
{
string pageSource = String.Empty;
var req = (HttpWebRequest)WebRequest.Create(url);
req.Method = "GET";
req.UserAgent = "MyCrawler/1.0";
req.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;
var resp = (HttpWebResponse)req.GetResponse();
// is this even html and not an image, or video
if (resp.ContentType.Contains("text/html"))
{
var sb = new StringBuilder();
var buffer = new char[8192];
// get the stream
using (var stream = resp.GetResponseStream())
using (var sr = new StreamReader(stream, Encoding.UTF8))
{
// start copying in blocks of 8K
var read = sr.ReadBlock(buffer, 0, buffer.Length);
while (read > 0)
{
sb.Append(buffer);
// max allowed chars per source
if (sb.Length > 50000)
{
sb.Append(" ... source truncated due to size");
// stop early
break;
}
read = sr.ReadBlock(buffer, 0, buffer.Length);
}
pageSource = sb.ToString();
}
}
return pageSource;
}
And you use this method like so:
var src = DownloadAsString("http://stackoverflow.com");
Console.WriteLine(src);
and this will output the html that makes up the front page of Stack Overflow. Notice that this output is capped at 50.000 chars, so it will show ... source truncated due to size at the end of the src
string.