2

I'm downloading 100K+ files and want to do it in patches, such as 100 files at a time.

static void Main(string[] args) {
    Task.WaitAll(
      new Task[]{
           RunAsync()
    });
}

// each group has 100 attachments.
static async Task RunAsync() {
    foreach (var group in groups) {
        var tasks = new List<Task>();
        foreach (var attachment in group.attachments) {
            tasks.Add(DownloadFileAsync(attachment, downloadPath));
        }
        await Task.WhenAll(tasks);
    }
}

static async Task DownloadFileAsync(Attachment attachment, string path) {
    using (var client = new HttpClient()) {
        using (var fileStream = File.Create(path + attachment.FileName)) {
            var downloadedFileStream = await client.GetStreamAsync(attachment.url);
            await downloadedFileStream.CopyToAsync(fileStream);
        }
    }
}

Expected Hoping it to download 100 files at a time, then download next 100;

Actual It downloads a lot more at the same time. Quickly got an error Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host

Quentin
  • 1,310
  • 18
  • 30
  • 2
    That's a shame it got marked as a duplicate, as the other question uses significantly different methods and I'd be glad to learn why does the one used by Quentin fail. – Bartosz Aug 04 '17 at 21:44
  • 2
    I agree; not a duplicate. My guess would be that the HttpClient methods return earlier than you hope. – BradleyDotNET Aug 04 '17 at 22:02
  • 2
    I highly recommend reading [Is async HttpClient from .Net 4.5 a bad choice for intensive load applications?](https://stackoverflow.com/questions/16194054/is-async-httpclient-from-net-4-5-a-bad-choice-for-intensive-load-applications). – Erik Philips Aug 04 '17 at 22:07
  • 2
    I had a similar issue while consuming a web service in .NET Core. Put the tasks into a queue and as soon as a task is finished, dequeue and run task from the queue. You should sync the queue ofcourse. That should work. – Mert Akcakaya Aug 04 '17 at 22:15
  • Thank you @MertAkcakaya Will try the queue method. – Quentin Aug 04 '17 at 22:19
  • @Quentin Some advice: Create one `HttpClient` and reuse it. It was meant to have one instance and reused. Constantly creating and disposing that class can have adverse effects. You should also only create the file if it was downloaded successfully. – Nkosi Aug 04 '17 at 22:47
  • @Nkosi Read it just now and advice taken. thank you. – Quentin Aug 04 '17 at 22:54

1 Answers1

4

Running tasks in "batch" is not a good idea in terms of performance. A long running task would make whole batch block. A better approach would be starting a new task as soon as one is finished.

This can be implemented with a queue as @MertAkcakaya suggested. But I will post another alternative based on my other answer Have a set of Tasks with only X running at a time

int maxTread = 3;
System.Net.ServicePointManager.DefaultConnectionLimit = 50; //Set this once to a max value in your app

var urls = new Tuple<string, string>[] {
    Tuple.Create("http://cnn.com","temp/cnn1.htm"),
    Tuple.Create("http://cnn.com","temp/cnn2.htm"),
    Tuple.Create("http://bbc.com","temp/bbc1.htm"),
    Tuple.Create("http://bbc.com","temp/bbc2.htm"),
    Tuple.Create("http://stackoverflow.com","temp/stackoverflow.htm"),
    Tuple.Create("http://google.com","temp/google1.htm"),
    Tuple.Create("http://google.com","temp/google2.htm"),
};
DownloadParallel(urls, maxTread);

async Task DownloadParallel(IEnumerable<Tuple<string,string>> urls, int maxThreads)
{
    SemaphoreSlim maxThread = new SemaphoreSlim(maxThreads);
    var client = new HttpClient();

    foreach(var url in urls)
    {
        await maxThread.WaitAsync();
        DownloadFile(client, url.Item1, url.Item2)
                    .ContinueWith((task) => maxThread.Release() );
    }
}


async Task DownloadFile(HttpClient client, string url, string fileName)
{
    var stream = await client.GetStreamAsync(url);
    using (var fileStream = File.Create(fileName))
    {
        await stream.CopyToAsync(fileStream);
    }
}

PS: DownloadParallel will return as soon as it starts the last download. So don't await it. If you really want to await it you should add for (int i = 0; i < maxThreads; i++) await maxThread.WaitAsync(); at the end of the method.

PS2: Don't forget to add exception handling to DownloadFile

L.B
  • 114,136
  • 19
  • 178
  • 224