1

I'm trying to download approx. 45.000 image files from an API. The image files have less than 50kb each. With my code this will take 2-3 Hours.

Is there an more efficient way in C# to download them?

private static readonly string baseUrl =
    "http://url.com/Handlers/Image.ashx?imageid={0}&type=image";
internal static void DownloadAllMissingPictures(List<ListObject> ImagesToDownload,
    string imageFolderPath)
{
    Parallel.ForEach(Partitioner.Create(0, ImagesToDownload.Count), range =>
    {
        for (var i = range.Item1; i < range.Item2; i++)
        {
            string ImageID = ImagesToDownload[i].ImageId;

            using (var webClient = new WebClient())
            {
                string url = String.Format(baseUrl, ImageID);
                string file = String.Format(@"{0}\{1}.jpg", imageFolderPath,
                    ImagesToDownload[i].ImageId);

                byte[] data = webClient.DownloadData(url);

                using (MemoryStream mem = new MemoryStream(data))
                {
                    using (var image = Image.FromStream(mem))
                    {
                        image.Save(file, ImageFormat.Jpeg);
                    }
                }                    
            }
        }
    });
}
Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104
Smutjes
  • 49
  • 8

3 Answers3

3

I tested some vraiations of your suggestions. The Code by Theodor Zoulias was my favourite.

It works fine and fast with approx 1.200 downloads per Minute.

This is the final Code i'm using now:

    private static readonly string _baseUrlPattern = "http://url.com/Handlers/Image.ashx?imageId={0}&type=card";

    private static readonly HttpClient _httpClient = new HttpClient();

    internal static void DownloadAllMissingPictures(CancellationToken cancellationToken = default)
    {
        ServicePointManager.DefaultConnectionLimit = 8;

        var parallelOptions = new ParallelOptions()
        {
            MaxDegreeOfParallelism = 10,
            CancellationToken = cancellationToken,
        };
        Parallel.ForEachAsync(ListWithImagesToDownload, parallelOptions, async (image, ct) =>
        {
            string imageId = image.identifiers.ImageId;
            string url = String.Format(_baseUrlPattern, imageId);
            string filePath = Path.Combine(imageFolderPath, $"{imageId}.jpg");

            using HttpResponseMessage response = await _httpClient.GetAsync(url, ct);
            response.EnsureSuccessStatusCode();

            using FileStream fileStream = File.OpenWrite(filePath);
            await response.Content.CopyToAsync(fileStream);
        }).Wait();
    }

The Code Idea by TomTom is fine, but stops after one loop. So i can't tell you wich impact the MaxConnectionsPerServer has on the Download speed.

I'm sorry i can't share some experience with you too. Bit as i said, i'm still a beginner with less than one year of programming experience.

Smutjes
  • 49
  • 8
2

The Parallel.ForEach method is not well suited for I/O-bound operations, because it requires a thread for each parallel workflow, and threads are not cheap resources. You can make it work by increasing the number of threads that the ThreadPool creates immediately on demand, with the SetMinThreads method, but that's not as efficient as using asynchronous programming and async/await. With asynchronous programming a thread is not required while the file is downloaded, or while the file is saved in the disc, so it is possible to download dozens of files concurrently using only a handful of threads.

Using the Partitioner for creating ranges is a useful technique when parallelizing extremely granular (lightweight) workloads, like adding or comparing numbers. In your case the workload is quite coarse (chunky), so using ranges is more likely to slow things down than speed them up. Using ranges prevents balancing the workload, in case some files take longer to download than others.

My suggestion is to use the Parallel.ForEachAsync method (introduced in .NET 6), which is designed specifically for parallelizing asynchronous I/O operations. Here is how you can use this method in order to download the files in parallel, with a specific degree of parallelism, and cancellation support:

private static readonly string _baseUrlPattern =
    "http://url.com/Handlers/Image.ashx?imageid={0}&type=image";

private static readonly HttpClient _httpClient = new HttpClient();

internal static void DownloadAllMissingPictures(
    IEnumerable<ListObject> imagesToDownload, string imageFolderPath,
    CancellationToken cancellationToken = default)
{
    var parallelOptions = new ParallelOptions()
    {
        MaxDegreeOfParallelism = 10,
        CancellationToken = cancellationToken,
    };
    Parallel.ForEachAsync(imagesToDownload, parallelOptions, async (image, ct) =>
    {
        string imageId = image.ImageId;
        string url = String.Format(_baseUrlPattern, imageId);
        string filePath = Path.Combine(imageFolderPath, imageId);
        using HttpResponseMessage response = await _httpClient.GetAsync(url, ct);
        response.EnsureSuccessStatusCode();
        using FileStream fileStream = File.OpenWrite(filePath);
        await response.Content.CopyToAsync(fileStream);
    }).Wait();
}

The Parallel.ForEachAsync method returns a Task. It's recommended that Tasks are awaited, but taking into account that you are probably not familiar with asynchronous programming yet, let's just Wait it instead for the time being.

In case the implementation above does not improve the performance of the whole procedure, you could experiment with the MaxDegreeOfParallelism configuration, and also with the settings mentioned in this question: How to increase the outgoing HTTP requests quota in .NET Core?

Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104
  • It does not matter how many parallel requests you do as long as you do not change the limitations of the http side - which is limited to 2 requests in parallel. Async is more efficient, but not to THAT degree - the OP runs straight into the http limitations as he never changes THOSE. Without those, you would overload the server, not wait for hours and get the results. As it is, the processing is limited to 2 parallel requests. – TomTom Nov 20 '21 at 21:41
  • 1
    @Smutjes: It would be very valuable for the broad public on this site if you could test the code suggested by Theodor Zoulias and publish the results here. Also, it would be very interesting and valuable if you could increase the number of concurrent connections (using `HttpClientHandler.MaxConnectionsPerServer` I guess) and published those results as well. – Kalle Svensson Nov 21 '21 at 11:47
  • @KalleSvensson see my Answer above https://stackoverflow.com/a/70056592/16172845 – Smutjes Nov 21 '21 at 18:29
  • 1
    @Smutjes: Thank you very much for your effort and this feedback. I understand that you complete the task within ca 40min (45.000/1200) compared to 2-3 hours using the original code. It would be very interesting to know how `DefaultConnectionLimit` affects the result, e.g. to understand the penalty for using the default value. Another thing worth understanding is the difference between using `ServicePointManager` and `HttpClientHandler` as the former should not be used for new development. – Kalle Svensson Nov 21 '21 at 19:58
  • @Smutjes: I know that I'm trying to load you with additional work, but you have a unique test environment and this knowledge will be very useful for yourself as well as for all the others. – Kalle Svensson Nov 21 '21 at 20:00
  • @KalleSvensson based on [this](https://github.com/dotnet/runtime/issues/1844 "Set sensible default value for HttpClientHandler.MaxConnectionsPerServer") GitHub issue, and if my understanding is correct, the `DefaultConnectionLimit` setting has no effect on .NET Core applications, and the `MaxConnectionsPerServer` can be used only to limit the number of connections per server, because by default it's unlimited. – Theodor Zoulias Nov 22 '21 at 11:49
  • @Theodor Zoulias: Smutjes uses .Net 6 so a new experience of using `ServicePointManager` or `HttpClientHandler` to change the number of concurrent connections would be invaluable. – Kalle Svensson Nov 22 '21 at 12:10
  • @KalleSvensson AFAIK the .NET 6 is just a newer version of the .NET Core, so I wouldn't expect much of a difference between the two platforms. 1.200 downloads per minute (or 20 downloads per second) is probably a hard limit imposed by either the network bandwidth, or the capabilities of the remote server, or the capabilities of the local storage device. It's unlikely to be a limit imposed by some settings in the platform. At least this is my theory. – Theodor Zoulias Nov 22 '21 at 12:32
  • @KalleSvensson I would test it if i could. But I have no idea of how to use a testing environment in Visual Studio. These are things i still have to learn. – Smutjes Nov 24 '21 at 10:26
  • 1
    @Smutjes: I only suggest to slightly change the code and rerun the test, nothing more. E.g. you could remove the statement `ServicePointManager.DefaultConnectionLimit = 8;` and verify if it has any impact on the execution time. Another trial could be to add statement `HttpClientHandler.MaxConnectionsPerServer = 8;` and take the execution time again. You've got a valuable help from the community so you might do this in return, so to speak. – Kalle Svensson Nov 25 '21 at 14:25
  • @KalleSvensson I'll do some tests next week and share the results with you. – Smutjes Nov 29 '21 at 10:57
  • @Smutjes Thank you, I'm waiting with great interest. I guess that quite a few other readers do it as well. – Kalle Svensson Dec 02 '21 at 10:34
1

Likely not.

One thing to think about though - besides NOT using WebClient as it has been replaced by HttpClient long time ago, you just missed the memo. I suggest a fast run through the documentation.

Regardless what you think you do with Parallel.Foreach - you are limited by the parallel connection settings (ServicePointManager, HttpClientHandler).

You should read the manuals for those and experiment with higher limits, because right now that is quite likely limiting your parallelism to a quite low number and possibly can handle 3-4 times the limit.

Maximum concurrent requests for WebClient, HttpWebRequest, and HttpClient

has a deeper explanation.

Suraj Rao
  • 29,388
  • 11
  • 94
  • 103
TomTom
  • 61,059
  • 10
  • 88
  • 148
  • Thank You. I 'missed' the Memo, because I'm still a programming Beginner. I'll change WebClient to HttpClient and do some experiments with it. – Smutjes Nov 20 '21 at 19:53
  • Well, you likely missed the memo because you followed some older tutorial somewhere - they started changing it some time. – TomTom Nov 20 '21 at 19:55
  • Yes, i think i followed an older one. Therefore i'm thankful for your Answer. – Smutjes Nov 20 '21 at 20:00