Downloading multiple files by fastly and efficiently(async)

Question

I have so many files that i have to download. So i try to use power of new async features as below.

var streamTasks = urls.Select(async url => (await WebRequest.CreateHttp(url).GetResponseAsync()).GetResponseStream()).ToList();

var streams = await Task.WhenAll(streamTasks);
foreach (var stream in streams)
{
    using (var fileStream = new FileStream("blabla", FileMode.Create))
    {
        await stream.CopyToAsync(fileStream);
    }
}

What i am afraid of about this code it will cause big memory usage because if there are 1000 files that contains 2MB file so this code will load 1000*2MB streams into memory?

I may missing something or i am totally right. If am not missed something so it is better to await every request and consume stream is best approach ?

@PaulZahra Since the contents of the files are streamed, rather than eagerly loaded into memory, that may not be a problem, depending on the implementation of `GetResponsesStream`. Getting the response stream doesn't necessarily mean loading the entire response, although it *could*. — Servy, May 27 '14 at 14:25
@Servy so this code will work quite efficiently if streams are already not loaded until it is used, right ? — Freshblood, May 27 '14 at 14:28
@Freshblood by the way... Although you use asynchronous method, it can block the main thread for a while. It's because before the async download itself, it checks the DNS name and this check is done internally by a blocking function. If you use IP instead of domain name, the async download will be fully asynchronous. — Paul Zahra, May 27 '14 at 14:30
@Freshblood I think what he's saying is write the stream as you read it, don't buffer too much... effectively it's the buffer that will eat the lions share of your memory. — Paul Zahra, May 27 '14 at 14:34
@PaulZahra What do think about chunk implementation as my answer http://stackoverflow.com/a/23893276/325661 — Freshblood, May 27 '14 at 15:25
@Freshblood It looks a reasonable implementation (perfection is very difficult)... but consider setting the buffer size of CopyToAsync... bufferSize The size, in bytes, of the buffer. This value must be greater than zero. The default size is 81920. Basically with 5 x 81920 you will be writing about 0.39 MBs to disk. — Paul Zahra, May 28 '14 at 08:07

score 5 · Answer 1 · edited May 23 '17 at 11:59

5

Both options could be problematic. Downloading only one at a time doesn't scale and takes time while downloading all files at once could be too much of a load (also, no need to wait for all to download before you process them).

I prefer to always cap such operation with a configurable size. A simple way to do so is to use an AsyncLock (which utilizes SemaphoreSlim). A more robust way is to use TPL Dataflow with a MaxDegreeOfParallelism.

var block = new ActionBlock<string>(url =>
    {
        var stream = (await WebRequest.CreateHttp(url).GetResponseAsync()).GetResponseStream();
        using (var fileStream = new FileStream("blabla", FileMode.Create))
        {
            await stream.CopyToAsync(fileStream);
        }
    },
    new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 100 });

edited May 23 '17 at 11:59

Community

1
1

answered May 27 '14 at 14:17

i3arnon

113,022
33
324
344

There is no ActionBlock class. I can not import it. May be Parallel.For is the right thing ? – Freshblood May 27 '14 at 14:26
@Freshblood "The TPL Dataflow Library is not distributed with the .NET Framework. To install it open your project in Visual Studio, choose Manage NuGet Packages from the Project menu, and search online for the Microsoft.Tpl.Dataflow package." – i3arnon May 27 '14 at 14:27
@Freshblood You can do similar things without that library, but you really should use it. – i3arnon May 27 '14 at 14:28
It looks like much simple chunk. Look at my answser. – Freshblood May 27 '14 at 15:25

Yuval Itzchakov · Answer 2 · 2014-05-27T14:38:03.647

3

Your code will load the stream into memory whether you use async or not. Doing async work handles the I/O part by returning to the caller until your ResponseStream returns.

The choice you have to make dosent concern async, but rather the implementation of your program concerning reading a big stream input.

If I were you, I would think about how to split the work load into chunks. You might read the ResponseStream in parallel and save each stream to a different source (might be to a file) and release it from memory.

edited May 27 '14 at 14:38

answered May 27 '14 at 14:17

Yuval Itzchakov

146,575
32
257
321

i provided an implementation of chunking idea from you. Can you please feedback for that answer. – Freshblood May 27 '14 at 15:24

score 2 · Answer 3 · edited May 27 '14 at 15:33

2

This is my own answer chunking idea from Yuval Itzchakov and i provide implementation. Please provide feedback for this implementation.

foreach (var chunk in urls.Batch(5))
{
    var streamTasks = chunk
        .Select(async url => await WebRequest.CreateHttp(url).GetResponseAsync())
        .Select(async response => (await response).GetResponseStream());

    var streams = await Task.WhenAll(streamTasks);

    foreach (var stream in streams)
    {
        using (var fileStream = new FileStream("blabla", FileMode.Create))
        {
            await stream.CopyToAsync(fileStream);
        }
    }
}

Batch is extension method that is simply as below.

public static IEnumerable<IEnumerable<T>> Batch<T>(this IEnumerable<T> source, int chunksize)
{
    while (source.Any())
    {
        yield return source.Take(chunksize);
        source = source.Skip(chunksize);
    }
}

edited May 27 '14 at 15:33

i3arnon

113,022
33
324
344

answered May 27 '14 at 15:23

Freshblood

6,285
10
59
96

1

When you use batches, you wait for all files to download before you start processing even if some take longer than others. It's less scalable. – i3arnon May 27 '14 at 15:28
Is there any reason for using `HttpWebResponse` instead of `HttpClient`? – Yuval Itzchakov May 27 '14 at 15:35
HttpClient is simpler than WebRequest? What is pros over WebRequest ? – Freshblood May 27 '14 at 15:41
@I3arnon Are you sure about that ? You are probably wrong. This will process 5 download async way and then next 5 download. – Freshblood May 27 '14 at 15:43
@YuvalItzchakov It is equivelant if i remove async and await keywords in lambda functions ? – Freshblood May 27 '14 at 15:44
It is better since it fully supports the `TAP` with `async/await` – Yuval Itzchakov May 27 '14 at 15:45
2

@Freshblood If one of the 5 ended before the rest you will wait at `Task.WhenAll` instead of starting a download of a different file. When I said "all" i meant all of the files in the batch. – i3arnon May 27 '14 at 15:46

Downloading multiple files by fastly and efficiently(async)

3 Answers3

Linked