3

My aim is to download images from an Amazon Web Services bucket.

I have the following code function which downloads multiple images at once:

public static void DownloadFilesFromAWS(string bucketName, List<string> imageNames)
{
    int batchSize = 50;
    int maxDownloadMilliseconds = 10000;

    List<Task> tasks = new List<Task>();

    for (int i = 0; i < imageNames.Count; i++)
    {
        string imageName = imageNames[i];
        Task task = Task.Run(() => GetFile(bucketName, imageName));
        tasks.Add(task);
        if (tasks.Count > 0 && tasks.Count % batchSize == 0)
        {
            Task.WaitAll(tasks.ToArray(), maxDownloadMilliseconds);//wait to download
            tasks.Clear();
        }
    }

    //if there are any left, wait for them
    Task.WaitAll(tasks.ToArray(), maxDownloadMilliseconds);
}

private static void GetFile(string bucketName, string filename)
{
    try
    {
        using (AmazonS3Client awsClient = new AmazonS3Client(Amazon.RegionEndpoint.EUWest1))
        {
            string key = Path.GetFileName(filename);

            GetObjectRequest getObjectRequest = new GetObjectRequest() {
                 BucketName = bucketName,
                    Key = key
            };

            using (GetObjectResponse response = awsClient.GetObject(getObjectRequest))
            {
                string directory = Path.GetDirectoryName(filename);
                if (!Directory.Exists(directory))
                {
                    Directory.CreateDirectory(directory);
                }

                if (!File.Exists(filename))
                {
                    response.WriteResponseStreamToFile(filename);
                }
            }
        }
    }
    catch (AmazonS3Exception amazonS3Exception)
    {
        if (amazonS3Exception.ErrorCode == "NoSuchKey")
        {
            return;
        }
        if (amazonS3Exception.ErrorCode != null && (amazonS3Exception.ErrorCode.Equals("InvalidAccessKeyId") || amazonS3Exception.ErrorCode.Equals("InvalidSecurity")))
        {
            // Log AWS invalid credentials
            throw new ApplicationException("AWS Invalid Credentials");
        }
        else
        {
            // Log generic AWS exception
            throw new ApplicationException("AWS Exception: " + amazonS3Exception.Message);
        }
    }
    catch
    {
        //
    }
}

The downloading of the images all works fine but the Task.WaitAll seems to be ignored and the rest of the code continues to be executed - meaning I try to get files that are currently non existent (as they've not yet been downloaded).

I found this answer to another question which seems to be the same as mine. I tried to use the answer to change my code but it still wouldn't wait for all files to be downloaded.

Can anyone tell me where I am going wrong?

Community
  • 1
  • 1
Percy
  • 2,855
  • 2
  • 33
  • 56
  • Should you not first add all download tasks and only **afterwards** wait for them to complete ? This is how I do it. The way you are doing - waiting right away after placing a new thread in the pool for it to complete - looks like synchronous programming - in other words you are not taking any benefit from multithreading. This does not directly relate to *not waiting* but... am leaving as a remark. – Veverke May 16 '16 at 09:53
  • @Veverke I'm new to Tasks so I thought if I added them all, suppose there were 10,000 files I needed to download, would all tasks be running at the same time and be too much for the system to handle??? - For this reason I decides to run the tasks in batches. – Percy May 16 '16 at 09:55
  • Well I missed the fact that you have 2 waits, I saw the first only. I then now ask something else: why do you need the one inside the loop ? Did you try running once without it ? Try commenting the if inside the for loop and see what happens. Do you get exceptions ? – Veverke May 16 '16 at 09:57
  • Again, as I am used to do: I add all my download tasks into a list - and afterwards I have a wait all on the list (array). In case there is a specific reason why you have that wait inside the for loop, please point out. Otherwise, do as suggested in my previous comment and let me know what happens. – Veverke May 16 '16 at 09:58
  • 1
    Relating to your comment - there is an amount of threads the operating system can handle in an effective way, and everything beyond that will not only not provide benefits but will decrease performance due task handling overheads (context switches, etc) - but I do not know what this number is. It also depends on how much time your tasks take to complete/process time estimate. If they are supposed to complete their job in 5 seconds, you should be able to handle much more of them concurrently rather than if each must take 5 minutes. – Veverke May 16 '16 at 10:03
  • The same issue exists when I do this. All my looping is doing is adding the tasks in batches of 50 then running them, then adding 50 more and running them - for the reasons I said above, to not impact system resources too much - I think the issue is because the task is finished but the actual download isn't - like in the link I've provided but I'm not sure how to resolve... – Percy May 16 '16 at 10:04
  • Is `WriteResponseStreamToFile` synchronous? – MicroVirus May 16 '16 at 10:04
  • @MicroVirus no it's not - there is a call to `awsClient.GetObjectAsync` which I tried to use in my solution to the problem but I still had the same error. – Percy May 16 '16 at 10:07
  • I do not see how you are breaking the list of image names into batches since I see no reference to `batchSize` within the for except inside the if statement. You are adding as many threads as List elements. – Veverke May 16 '16 at 10:07
  • 1
    The batching might actually be a good thing, because the thing you'll most likely be expending first is not the amount of tasks that can be run at the same time, but rather the number of connections to the web service and the bandwidth. – MicroVirus May 16 '16 at 10:07
  • Does Task.WaitAll() return immediately or does it return after the 10 seconds timeout that are specified and the files haven't finished downloading by then? – NineBerry May 16 '16 at 10:09
  • @Veverke I set the batchsize on this line `int batchSize = 50;` then the line `if (tasks.Count > 0 && tasks.Count % batchSize == 0)` determins when there are 50 Tasks ready to run – Percy May 16 '16 at 10:10
  • @NineBerry in testing I have 70 images to download so the `WaitAll` gets hit twice - total time to run is 20 seconds. The code then continues and I hit my `if(!File.Exists(....` code - even though I shouldn't - as it is hit for an image I've requested download of. I continue to run and the image eventually appears in the directory. – Percy May 16 '16 at 10:14
  • @Veverke - no, only 22 are downloaded before the wait resumes. – Percy May 16 '16 at 10:21

1 Answers1

6

The code behaves as expected. Task.WaitAll returns after ten seconds even when not all files have been downloaded, because you have specified a timeout of 10 seconds (10000 milliseconds) in variable maxDownloadMilliseconds.

If you really want to wait for all downloads to finish, call Task.WaitAll without specifying a timeout.

Use

Task.WaitAll(tasks.ToArray());//wait to download

at both places.

To see some good explanations on how to implement parallel downloads while not stressing the system (only have a maximum number of parallel downloads), see the answer at How can I limit Parallel.ForEach?

Community
  • 1
  • 1
NineBerry
  • 26,306
  • 3
  • 62
  • 93
  • Hmmmm... I assumed after 10 seconds, the unfinished Tasks would be aborted... Let me test... – Percy May 16 '16 at 10:23
  • 1
    The unfinished tasks are not aborted, the WaitAll just returns while the tasks continue running in the background. – NineBerry May 16 '16 at 10:24
  • 1
    If you want the tasks getting aborted you have to bring a CancellationToken within this game :o) – Sir Rufo May 16 '16 at 10:27
  • 1
    @NineBerry It wont abort them. It just returns false if the tasks are still executing and you can choose to continue waiting or to cancel them manually. – CathalMF May 16 '16 at 10:27
  • DAAAAAAAAAAAAAAAAAM!!! totally correct! Thank you for this... Still learning about Tasks so missed this. – Percy May 16 '16 at 10:29
  • If something errors whilst downloading a file, what will happen to the task? will it be considered finished or will it wait forever? - In other words, should I include a longer wait time, e.g. 60 seconds so I don't end up waiting forever? – Percy May 16 '16 at 10:30
  • 1
    The final Status of a task is RanToCompletion, Canceled or Faulted – Sir Rufo May 16 '16 at 10:33
  • 1
    When an exception occurs in the task, the task function finishes, so the task is considered finished. – NineBerry May 16 '16 at 10:34
  • 1
    Added a link to some good answer for the scenario you are dealing with – NineBerry May 16 '16 at 10:46
  • @NineBerry just been looking at the link - thanks. As a test I changed my batch from 50 to 10 and ran it - it took 264 seconds, I then used `Parallel.ForEach(imageNames, imageName => GetFile(bucketName, imageName));` and it took 478 seconds! – Percy May 16 '16 at 11:56
  • @Rick Of course it can be slower - less stress means more time for the tasks. Another option can be [pipelining](https://en.m.wikipedia.org/wiki/Instruction_pipelining) with a downloading and storing stage. But remeber that an elegant solution is sometimes a bit slower than a brute one – Sir Rufo May 16 '16 at 14:34
  • @SirRufo thanks, if you were implementing the above, would you choose Parallel over my Task solution, even with the increased time? FYI the real situation could be downloading over 10,000 images. – Percy May 16 '16 at 14:38
  • @Rick I would choose a pipeline for that. Let's say the download stage with 4 tasks and the storing stage with 1 task. – Sir Rufo May 16 '16 at 14:55
  • @SirRufo thanks, I'll read up on pipelines - never used them before. – Percy May 16 '16 at 14:57
  • 1
    @Rick You can start here [Walkthrough: Creating a Dataflow Pipeline](https://msdn.microsoft.com/library/hh228604%28v=vs.110%29.aspx) – Sir Rufo May 16 '16 at 15:03