4

I have so many files that i have to download. So i try to use power of new async features as below.

var streamTasks = urls.Select(async url => (await WebRequest.CreateHttp(url).GetResponseAsync()).GetResponseStream()).ToList();

var streams = await Task.WhenAll(streamTasks);
foreach (var stream in streams)
{
    using (var fileStream = new FileStream("blabla", FileMode.Create))
    {
        await stream.CopyToAsync(fileStream);
    }
}

What i am afraid of about this code it will cause big memory usage because if there are 1000 files that contains 2MB file so this code will load 1000*2MB streams into memory?

I may missing something or i am totally right. If am not missed something so it is better to await every request and consume stream is best approach ?

Freshblood
  • 6,285
  • 10
  • 59
  • 96
  • But what if it is 1 file of 2GB or 4 x 500MB etc etc? – Paul Zahra May 27 '14 at 14:23
  • 1
    @PaulZahra Since the contents of the files are streamed, rather than eagerly loaded into memory, that may not be a problem, depending on the implementation of `GetResponsesStream`. Getting the response stream doesn't necessarily mean loading the entire response, although it *could*. – Servy May 27 '14 at 14:25
  • @Servy so this code will work quite efficiently if streams are already not loaded until it is used, right ? – Freshblood May 27 '14 at 14:28
  • 1
    @Freshblood by the way... Although you use asynchronous method, it can block the main thread for a while. It's because before the async download itself, it checks the DNS name and this check is done internally by a blocking function. If you use IP instead of domain name, the async download will be fully asynchronous. – Paul Zahra May 27 '14 at 14:30
  • @Freshblood I think what he's saying is write the stream as you read it, don't buffer too much... effectively it's the buffer that will eat the lions share of your memory. – Paul Zahra May 27 '14 at 14:34
  • @PaulZahra What do think about chunk implementation as my answer http://stackoverflow.com/a/23893276/325661 – Freshblood May 27 '14 at 15:25
  • @Freshblood It looks a reasonable implementation (perfection is very difficult)... but consider setting the buffer size of CopyToAsync... bufferSize The size, in bytes, of the buffer. This value must be greater than zero. The default size is 81920. Basically with 5 x 81920 you will be writing about 0.39 MBs to disk. – Paul Zahra May 28 '14 at 08:07

3 Answers3

5

Both options could be problematic. Downloading only one at a time doesn't scale and takes time while downloading all files at once could be too much of a load (also, no need to wait for all to download before you process them).

I prefer to always cap such operation with a configurable size. A simple way to do so is to use an AsyncLock (which utilizes SemaphoreSlim). A more robust way is to use TPL Dataflow with a MaxDegreeOfParallelism.

var block = new ActionBlock<string>(url =>
    {
        var stream = (await WebRequest.CreateHttp(url).GetResponseAsync()).GetResponseStream();
        using (var fileStream = new FileStream("blabla", FileMode.Create))
        {
            await stream.CopyToAsync(fileStream);
        }
    },
    new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 100 });
Community
  • 1
  • 1
i3arnon
  • 113,022
  • 33
  • 324
  • 344
  • There is no ActionBlock class. I can not import it. May be Parallel.For is the right thing ? – Freshblood May 27 '14 at 14:26
  • @Freshblood "The TPL Dataflow Library is not distributed with the .NET Framework. To install it open your project in Visual Studio, choose Manage NuGet Packages from the Project menu, and search online for the Microsoft.Tpl.Dataflow package." – i3arnon May 27 '14 at 14:27
  • @Freshblood You can do similar things without that library, but you really should use it. – i3arnon May 27 '14 at 14:28
  • It looks like much simple chunk. Look at my answser. – Freshblood May 27 '14 at 15:25
3

Your code will load the stream into memory whether you use async or not. Doing async work handles the I/O part by returning to the caller until your ResponseStream returns.

The choice you have to make dosent concern async, but rather the implementation of your program concerning reading a big stream input.

If I were you, I would think about how to split the work load into chunks. You might read the ResponseStream in parallel and save each stream to a different source (might be to a file) and release it from memory.

Yuval Itzchakov
  • 146,575
  • 32
  • 257
  • 321
2

This is my own answer chunking idea from Yuval Itzchakov and i provide implementation. Please provide feedback for this implementation.

foreach (var chunk in urls.Batch(5))
{
    var streamTasks = chunk
        .Select(async url => await WebRequest.CreateHttp(url).GetResponseAsync())
        .Select(async response => (await response).GetResponseStream());

    var streams = await Task.WhenAll(streamTasks);

    foreach (var stream in streams)
    {
        using (var fileStream = new FileStream("blabla", FileMode.Create))
        {
            await stream.CopyToAsync(fileStream);
        }
    }
}

Batch is extension method that is simply as below.

public static IEnumerable<IEnumerable<T>> Batch<T>(this IEnumerable<T> source, int chunksize)
{
    while (source.Any())
    {
        yield return source.Take(chunksize);
        source = source.Skip(chunksize);
    }
}
i3arnon
  • 113,022
  • 33
  • 324
  • 344
Freshblood
  • 6,285
  • 10
  • 59
  • 96