I've run this on the console and made a few changes to compare different configurations.
- Without any exceptions, multiple runs for a list of 2000 tasks took between 2179 and 2189 ms.
- With 5% chance of an exception, they took ~4s
- With 50% chance of an exception, they took 22-23s.
- With 75% chance of an exception, they took 31s-33s.
- With an exception thrown each time, they took 42s
That's close to a linear relation - each exception adds around 20ms.
My first guess for the 20ms would be that Thread.Sleep(1) actually is not as precise as you'd hope. The documentation mentions the system clock resolution, which in the documentation for Task.Delay is described in a sidenote as 15ms. A stopwatch seemed to confirm that sleeping actually took 20ms. But with actual CPU-bound work that only took 2ms, instead of 40s it now took 60s when each task had to handle exceptions.
And when I increase the sleep duration to 1000ms, with 5% exceptions the total duration is ~12.2s. So it's not simply the sleep time adding up sequentially.
So I've used some debug outputs to see what's happening when running 10 tasks, all throwing exceptions that lead to 1000ms sleep. I've added outputs comparing task-local and global stopwatch times for: start, throwing, catching, after resuming from sleep.
Starting task 1 at: 0ms (task-local) 15ms (global)
Starting task 2 at: 0ms (task-local) 40ms (global)
Starting task 3 at: 0ms (task-local) 41ms (global)
Starting task 4 at: 0ms (task-local) 43ms (global)
Starting task 5 at: 0ms (task-local) 45ms (global)
Starting task 6 at: 0ms (task-local) 47ms (global)
Starting task 7 at: 0ms (task-local) 50ms (global)
Starting task 8 at: 0ms (task-local) 51ms (global)
Starting task 9 at: 0ms (task-local) 53ms (global)
Starting task 10 at: 0ms (task-local) 55ms (global)
Started Thread 2948
Started Thread 15380
Started Thread 17984
Started Thread 11564
Throwing exception in task 7 at: 2016ms (task-local) 2066ms (global)
Throwing exception in task 8 at: 2014ms (task-local) 2066ms (global)
Throwing exception in task 6 at: 2018ms (task-local) 2066ms (global)
Throwing exception in task 5 at: 2020ms (task-local) 2066ms (global)
Caught exception in task 7 at: 2084ms (task-local) 2134ms (global)
Caught exception in task 6 at: 2098ms (task-local) 2145ms (global)
Caught exception in task 5 at: 2102ms (task-local) 2147ms (global)
Caught exception in task 8 at: 2109ms (task-local) 2161ms (global)
Started Thread 15468
Throwing exception in task 9 at: 3015ms (task-local) 3068ms (global)
Caught exception in task 9 at: 3035ms (task-local) 3088ms (global)
Slept for 1s in task 7 at: 3107ms (task-local) 3157ms (global)
Throwing exception in task 4 at: 3118ms (task-local) 3162ms (global)
Slept for 1s in task 5 at: 3118ms (task-local) 3164ms (global)
Slept for 1s in task 6 at: 3115ms (task-local) 3162ms (global)
Slept for 1s in task 8 at: 3122ms (task-local) 3173ms (global)
Throwing exception in task 2 at: 3145ms (task-local) 3185ms (global)
Throwing exception in task 1 at: 3173ms (task-local) 3188ms (global)
Caught exception in task 4 at: 3143ms (task-local) 3187ms (global)
Caught exception in task 2 at: 3179ms (task-local) 3219ms (global)
Caught exception in task 1 at: 3215ms (task-local) 3230ms (global)
Throwing exception in task 3 at: 3142ms (task-local) 3184ms (global)
Caught exception in task 3 at: 3213ms (task-local) 3255ms (global)
Started Thread 27088
Throwing exception in task 10 at: 4010ms (task-local) 4065ms (global)
Caught exception in task 10 at: 4031ms (task-local) 4086ms (global)
Slept for 1s in task 9 at: 4038ms (task-local) 4092ms (global)
Slept for 1s in task 2 at: 4193ms (task-local) 4233ms (global)
Slept for 1s in task 4 at: 4190ms (task-local) 4233ms (global)
Slept for 1s in task 1 at: 4221ms (task-local) 4236ms (global)
Slept for 1s in task 3 at: 4222ms (task-local) 4264ms (global)
Slept for 1s in task 10 at: 5035ms (task-local) 5091ms (global)
Compare this to the output without exceptions:
Starting task 1 at: 0ms (task-local) 2ms (global)
Starting task 2 at: 0ms (task-local) 36ms (global)
Starting task 3 at: 0ms (task-local) 38ms (global)
Starting task 4 at: 0ms (task-local) 40ms (global)
Starting task 5 at: 0ms (task-local) 42ms (global)
Starting task 6 at: 0ms (task-local) 46ms (global)
Starting task 7 at: 0ms (task-local) 48ms (global)
Starting task 8 at: 0ms (task-local) 50ms (global)
Starting task 9 at: 0ms (task-local) 53ms (global)
Starting task 10 at: 0ms (task-local) 55ms (global)
Started Thread 12496
Started Thread 16480
Started Thread 10184
Started Thread 17696
Done waiting in task 2 at: 2023ms (task-local) 2059ms (global)
Done waiting in task 5 at: 2016ms (task-local) 2059ms (global)
Done waiting in task 4 at: 2018ms (task-local) 2059ms (global)
Done waiting in task 3 at: 2020ms (task-local) 2059ms (global)
Done waiting in task 1 at: 2066ms (task-local) 2068ms (global)
Done waiting in task 6 at: 2024ms (task-local) 2070ms (global)
Done waiting in task 7 at: 2023ms (task-local) 2071ms (global)
Done waiting in task 10 at: 2017ms (task-local) 2072ms (global)
Done waiting in task 9 at: 2018ms (task-local) 2071ms (global)
Done waiting in task 8 at: 2020ms (task-local) 2071ms (global)
I'm seeing a block of four tasks being handled together in the beginning. When they throw, they're throwing, being caught and handled together, then together they block the main thread before the next block of four can proceed, but it's starting to overlap a bit. For a huge number of tasks and exceptions, the exceptions trickle in sequentially in blocks - maybe depending on the number of available cores?
Writing console output can of course mess up the measurements, but I've seen that after roughly 800 tasks, 2000ms have passed, and the first task is done waiting. So while not all tasks have even started, the first exception starts slowing down the main thread.
Seeing this, I think it depends on your actual code when it comes to improving it. Can you change the API to allow sending multiple rows with one request?
Maybe don't start sending all 6000 requests at once and individually - if one fails, the rest are likely to also fail. Maybe a single catch block around the await Task.WhenAll
is enough?
Update:
I did what I should have done in the beginning and measured without flooding the console. Now it's more obvious that catching each exception takes a significant amount of time (>20 ms). And that's happening on one thread, so one exception at a time. When the exception handling really takes only around 1ms, then switching to await Task.WhenAll
just saves you as many milliseconds as you have exceptions.