How to asynchronically download millions of files from a file storage?

Question

Let's assume I have a database managing millions of documents, which are stored on a WebDav or SMB server, which does not support retrieving documents in bulks. Given a list of (potentially all) document IDs, how do I download the corresponding documents as fast as possible?

Iterating over the list and sequentially downloading them is far too slow. The 2 options I see is threads and async downloads.

My gut says that async programming should be preferred to threads, because I'm just waiting for IO on the client side. But I am rather new to async programming and I don't know how to do it. I assume that iterating over the whole list and sending an async download request could potentially lead to too many requests in a very short time leading to rejected requests. So how do I throttle this? Is there a best practice way to do this?

That's great! The semaphore solution seems to do the job, but I wasn't aware of the TPL dataflow library, which seems very interesting! Actually, if you put your comments into an answer, I would accept this! — Ben, Apr 08 '20 at 22:19
Just something to keep in mind: you're asking about "asynchronically" when you really mean "concurrently". They are two separate concepts. You require threads for this regardless of whether you use `async` or not. — Enigmativity, Apr 09 '20 at 03:37
I am aware of that. But I know how to download files in multiple threads running in parallel. What I didn't know was how to deal with the problem of too many requests that would only arise when using async programming. — Ben, Apr 09 '20 at 13:54

score 1 · Accepted Answer · answered Apr 08 '20 at 22:32

Take a look at this: How to limit the amount of concurrent async I/O? Using a SemaphoreSlim, as suggested in the accepted answer, is an easy and quite good solution.

My personal favorite though for this kind if job is the TPL Dataflow library. You can see here an example of using this library to download pages from the web asynchronously with a configurable level of concurrency, in combination with the HttpClient class. Here is another example.

score 0 · Answer 2 · answered Apr 09 '20 at 13:59

0

I also found this great article explaining 4 different ways to limit the number of concurrent downloads.

answered Apr 09 '20 at 13:59

Ben

4,486
6
33
48

The suggestions of [that article](https://markheath.net/post/constraining-concurrent-threads-csharp) are not top notch quality. For example the *Technique 1 - ConcurrentQueue* suffers from the problem of diminishing concurrency in case of exceptions. In case of a web request failure, the worker that made this request will die, but the remaining workers will continue pumping the queue. The level of concurrency will be decreased by one. The error will be eventually surfaced at the line `await Task.WhenAll(tasks);` when the queue become empty or all the workers have died, whatever happen first. – Theodor Zoulias Apr 09 '20 at 16:56
well... you could easily handle exceptions in the thread. I think it was omitted for simplicity reasons. It is more about the concept. – Ben Apr 09 '20 at 18:33
Yes, but why bother with an incomplete concept when there is a ready to use tool available? With TPL Dataflow you get a production-ready tool with 9 years of history, that offers all kind of options (cancellation, bounded capacity, ordering of the results etc). And can also be used for more complex scenarios, like linking two or more blocks together and forming a pipeline, with different concurrency level for each block. There is a lot of power there! – Theodor Zoulias Apr 09 '20 at 19:03
Btw all Dataflow blocks have an internal buffer where the items to be processed are stored, a buffer similar or identical to a `ConcurrentQueue`. Most of them have also internal workers for processing these items. You can control how many items will be processed by each worker before it is recycled (the rarely used [`MaxMessagesPerTask`](https://learn.microsoft.com/en-us/dotnet/api/system.threading.tasks.dataflow.dataflowblockoptions.maxmessagespertask) property). In short you could tell that the TPL Dataflow is a polished implementation of the *Technique 1 - ConcurrentQueue* concept. – Theodor Zoulias Apr 09 '20 at 19:37
No doubt that the TPL Dataflow library isn't a great tool! But it is always interesting to see different options how to do something together with their pros and cons. – Ben Apr 10 '20 at 10:43
I agree. Learning about the various throttling mechanisms can be fascinating by itself, like it's studying the various sorting algorithms. :-) – Theodor Zoulias Apr 10 '20 at 11:52

How to asynchronically download millions of files from a file storage?

2 Answers2