0

I have a HttpClient and try to download roughly 25k 6byte size files into ram in order to generate a SHA256 key from it.

I have tried to make it parallel to the best of my ability but couldn't see a change in speed at all.

The wrapping function which is starting the download tasks in parallel:
SemaphoreSlim downloadConcurrencySemaphore = new SemaphoreSlim(40);
ConcurrentQueue<Task> Sha256Tasks = new ConcurrentQueue<Task>();
foreach (string url in urls)
{
    var t = Task.Run(async () =>
    {
        await downloadConcurrencySemaphore.WaitAsync();
        try
        {
            await uploadedFile.CalculateChecksum();
        }
        catch (Exception ex)
        {
            { } // breakpoint for debugging
        }
        finally
        {
            downloadConcurrencySemaphore.Release();
        }
    });
    Sha256Tasks.Enqueue(t);
}
Task.WaitAll(Sha256Tasks.ToArray());

The CalculateChecksum called by wrapping function, downloads file and generates sha256sum from byte array resulting from download:
public async Task CalculateChecksum()
{ // is beeing called up to 40 times in parallel (connection limit)
    byte[] file = await API.DownloadClient.Download(URL);
    Sha256Sum = Sha256.GetSha256Sum(file);
}
The DownloadClient class being called in parallel to download the files:
internal static class DownloadClient
{
    static DownloadClient()
    {
        _Client = new HttpClient();
        ServicePointManager.DefaultConnectionLimit = 40;

        var handler = new HttpClientHandler();
        handler.ClientCertificateOptions = ClientCertificateOption.Manual;
        handler.ServerCertificateCustomValidationCallback =
            (httpRequestMessage, cert, cetChain, policyErrors) =>
            {
                return true;
            };
        handler.MaxConnectionsPerServer = 40;

        _Client = new HttpClient(handler);
    }
    private static HttpClient _Client;
    internal static async Task<byte[]> Download(string url)
    {
        HttpResponseMessage response = await _Client.GetAsync(url);
        response.EnsureSuccessStatusCode();
        System.Net.Http.HttpContent content = response.Content;
        byte[] file = await content.ReadAsByteArrayAsync();
        return file;
    }
}

Any Idea on how to make the process faster? Right now it takes ~15 minutes for ~5.5mb of files, processor and network barely used, program is idling around, waiting for downloads.

Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104
julian bechtold
  • 1,875
  • 2
  • 19
  • 49
  • No matter how much you try to parallelize, if the endpoint/backend you are requesting the data from can't keep up with your many concurrent requests or intentionally throttles when exposed to concurrent requests (from the same IP address or so), well, that's it then... –  Nov 25 '22 at 15:29
  • Could you test [this thing](https://stackoverflow.com/a/74554087/7444103) (.Net 6+), to see how it goes with different `MaxDegreeOfParallelism` settings and also changing the HttClient's `DefaultRequestHeaders.ConnectionClose` value? `MaxConnectionsPerServer` is the ~equivalent of what you're setting now – Jimi Nov 25 '22 at 15:35
  • changed it from synchronous to async recently. – julian bechtold Nov 25 '22 at 15:35
  • Hmm, what is the value [ServicePointManager.DefaultConnectionLimit](https://learn.microsoft.com/en-us/dotnet/api/system.net.servicepointmanager.defaultconnectionlimit) and what if you increase it (right at the start of your program before it does any HttpClient-related stuff)? Does it improve things? –  Nov 25 '22 at 15:40
  • @Jimi connectionClose to true or false? – julian bechtold Nov 25 '22 at 15:47
  • Both. To see how the Server reacts. Try with `false` first (or comment out that property, same thing) -- Try to synchronize `MaxDegreeOfParallelism` and `MaxConnectionsPerServer` – Jimi Nov 25 '22 at 15:49
  • @MySkullCaveIsADarkPlace in theory this affects the clients internal connection limit. `ServicePointManager.DefaultConnectionLimit` should be global whereas `handler.MaxConnectionsPerServer = 40;` rhould be client specific. Neither of which made much of a difference. Files should be located on decentralized servers... – julian bechtold Nov 25 '22 at 15:49
  • @TheodorZoulias changed that. no change. Target framework is .net 6 – julian bechtold Nov 25 '22 at 16:14
  • You could try replacing the parallelizing code with the simpler and more lightweight [`Parallel.ForEachAsync`](https://learn.microsoft.com/en-us/dotnet/api/system.threading.tasks.parallel.foreachasync) method (usage example [here](https://stackoverflow.com/questions/15136542/parallel-foreach-with-asynchronous-lambda/68901782#68901782)), but I don't expect significant improvement in the overall performance. – Theodor Zoulias Nov 25 '22 at 16:23
  • Does it make any difference if you run the code without the debugger attached? (with Ctrl+F5) – Theodor Zoulias Nov 25 '22 at 16:26
  • Also could you try running the code with different `MaxDegreeOfParallelism` configurations (like 10, 20, 50, 100), and report how this affects the overall performance? – Theodor Zoulias Nov 25 '22 at 16:30
  • hmm. After the other changes + maxdegree of parallelissm (max 1000 concurrent requests) The files are downloaded in a matter of seconds. Throws some httprequest exception though. Ill ty to tune it down to a more reasonable level. – julian bechtold Nov 25 '22 at 16:36
  • Did you test the class in the post I've linked? It handles exceptions and reports back (if needed) in case you need to know what files could be processed and the reason why that happened -- If you get stopping exceptions while debugging it's because you didn't configure the debugger. As mentioned, you can start without debugger and log the HTTP / other exceptions – Jimi Nov 25 '22 at 16:58
  • 1
    You can't expect to have your CPU utilized because sha256 takes no time compared to fetching file from remote web server. And reasonable web server will not allow you to issue enough requests concurrently to utilize your CPU. – Evk Nov 25 '22 at 16:58
  • 1
    Don’t create new HttpClients for every request. This is extremely expensive as it needs to reserve a port from the OS. Instead have one instance and reuse it. HttpClient is thread-safe. – ckuri Nov 25 '22 at 17:17

0 Answers0