0

I have a large number of images I'm try to download (and then resize and save). I'm trying to do this in the most efficient way possible. I have opted for to use a BlockingCollection and with a producer which downloads the images and a consumer that will resize and save the images once they've been downloaded.

The main issue I'm having is with the producer which downloads the images. I'm using a SemaphoreSlim to queue up the download tasks and from the output I can tell it's working more quickly and nice and asynchronously. I am adding to a list of tasks and then using Task.WaitAll to wait for all the downloads to finish before finalising the whole process. the problem is that WaitAll doesn't seem to actually wait for all the tasks to finish before continuing as you can see from this screenshot:

enter image description here

Here's my dummy code which will reproduce the issue:

class Program
{
    static IDictionary<string, string> imageUrls;
    static BlockingCollection<Image> queue;

    static void Main(string[] args)
    {
        imageUrls = new Dictionary<string, string>();
        for (int i = 0; i < 20; i++)
        {
            imageUrls.Add($"{i}-1", "https://emergingpayments.org/wp-content/uploads/2017/11/landscape-2.jpeg");
            imageUrls.Add($"{i}-2", "https://static.photocdn.pt/images/articles/2018/03/09/articles/2017_8/landscape_photography.jpg");
            imageUrls.Add($"{i}-3", "https://wallup.net/wp-content/uploads/2015/12/258088-sunset-landscape-horizon.jpg");
        }

        queue = new BlockingCollection<Image>();

        Task.WaitAll(
            Task.Run(() => processImages()),
            Task.Run(() => downloadImages())
        );
    }

    static async Task downloadImages()
    {
        using (var client = new HttpClient(new HttpClientHandler { MaxConnectionsPerServer = 100 }))
        {
            var stopwatch = new Stopwatch();
            stopwatch.Start();

            using (var semaphore = new SemaphoreSlim(20))
            {
                var downloadTasks = new List<Task>();
                foreach (var imageUrl in imageUrls)
                {
                    await semaphore.WaitAsync();
                    downloadTasks.Add(Task.Factory.StartNew(async () =>
                    {
                        try
                        {
                            var imageData = await client.GetByteArrayAsync(imageUrl.Value);
                            queue.Add(new Image { Id = imageUrl.Key, ImageData = imageData });
                            Console.WriteLine($"Downloaded image {imageUrl.Key}");
                        }
                        catch (Exception ex)
                        {
                            Console.WriteLine($"Download failed for image {imageUrl.Key} - {ex.Message}");
                        }
                        finally
                        {
                            semaphore.Release();
                        }
                    }));
                }

                Task.WaitAll(downloadTasks.ToArray());
                Console.WriteLine($"Downloading took {stopwatch.Elapsed.TotalSeconds} seconds");
            }
        }            
    }

    static void processImages()
    {
        while (!queue.IsCompleted)
        {
            Image image = null;
            try
            {
                image = queue.Take();
            }
            catch (InvalidOperationException) { }

            if (image != null)
            {
                Thread.Sleep(100);
                Console.WriteLine($"Processed image {image.Id}");
                Console.WriteLine($"Processed image {image.Id}");
                Console.WriteLine($"Processed image {image.Id}");
            }
        }
    }

    class Image
    {
        public string Id { get; set; }
        public byte[] ImageData { get; set; }
    }
}

My solution is similar to this: https://stackoverflow.com/a/36571420/5392786

I think it may be a problem with where I'm releasing the semaphore or something but I can't seem to figure out what the problem is.

Andy Furniss
  • 3,814
  • 6
  • 31
  • 56
  • 1
    Use Task.Run instead of Task.Factory.StartNew (in this case it's the source of your problem and not just "good practice"). Also worth knowing that you should use `await Task.WhenAll` instead of `Task.WaitAll` where appropriate (in async methods for example). – Evk Apr 24 '18 at 21:40
  • 1
    Probably not what you want to hear when you're already committed to this path... but you'd do well to look into [TPL DataFlow](https://learn.microsoft.com/en-us/dotnet/standard/parallel-programming/dataflow-task-parallel-library) to coordinate all of this. Stephen Cleary has a nice writeup: https://blog.stephencleary.com/2012/09/introduction-to-dataflow-part-1.html . In the meantime, do you understand the difference between `Task.Run` and `Task.Factory.StartNew`? You should do. I'd also consider going async **all** the way, so you don't need to rely on blocking to get the job done. – spender Apr 24 '18 at 21:40
  • you started process and download at the same time. and the waitall inside download only applies to download. I see no problem with the output – Steve Apr 24 '18 at 21:41
  • 1
    ...If you don't want to go all the way with TPL DataFlow, at least take a look at [`BufferBlock`](https://stackoverflow.com/a/7868264/14357)... it provides a great alternative to BlockingCollection. It doesn't block and allows for `await`ing the next item. Invaluable in an async world. – spender Apr 24 '18 at 21:47
  • Thanks a lot guys. In this scenario `Task.Run` was the simple solution. I was originally using this but having scoured the internet for solutions my code ended up as an amalgamation of my attempts to solve the issue which sadly left me with some poor code in this case! @spender - I am open to better solutions so I will have a look at both TPL Dataflow and the BufferBlock - thanks a bunch for your advice! – Andy Furniss Apr 24 '18 at 21:49
  • @spender TPL Dataflow is beautiful. You are my hero. – Andy Furniss Apr 24 '18 at 22:32
  • 2
    @AndyFurniss :) I'm glad you took a peek... I drop the "try DataFlow" comment at least once a month, but most people don't seem that interested. It seemed a good fit for the processing pipeline you're trying to make. IMO, it's a poorly advertised gem hiding in plain sight. – spender Apr 24 '18 at 22:52

0 Answers0