Full CPU usage for Parallel.For loops

Question

I am writing a WPF application that processes an image data stream from an IR camera. The application uses a class library for processing steps such as rescaling or colorizing, which I am also writing myself. An image processing step looks something like this:

ProcessFrame(double[,] frame)
{
  int width = frame.GetLength(1);
  int height = frame.GetLength(0);
  byte[,] result = new byte[height, width];
  Parallel.For(0, height, row =>
  {
    for(var col = 0; col < width; ++col)
      ManipulatePixel(frame[row, col]);
  });
}

Frames are processed by a task that runs in the background. The issue is, that depending on how costly the specific processing algorithm is ( ManipulatePixel() ), the application can't keep up with the camera's frame rate any more. However, I have noticed that despite me using parallel for loops, the application simply won't use all of the CPU that is available - task manager performance tab shows about 60-80% CPU usage.

I have used the same processing algorithms in C++ before, using the concurrency::parallel_for loops from the parallel patterns library. The C++ code uses all of the CPU it can get, as I would expect, and I also tried PInvoking a C++ DLL from my C# code, doing the same algorithm that runs slowly in the C# library - it also uses all the CPU power available, CPU usage is right at 100% virtually the whole time and there is no trouble at all keeping up with the camera.

Outsourcing the code into a C++ DLL and then marshalling it back into C# is an extra hassle I'd of course rather avoid. How do I make my C# code actually make use of all the CPU potential? I have tried increasing process priority like this:

  using (Process process = Process.GetCurrentProcess())
    process.PriorityClass = ProcessPriorityClass.RealTime;

Which has an effect, but only a very small one. I also tried setting the degree of parallelism for the Parallel.For() loops like this:

ParallelOptions parallelOptions = new ParallelOptions();
parallelOptions.MaxDegreeOfParallelism = Environment.ProcessorCount;

and then passing that to the Parallel.For() loop, this had no effect at all but I suppose that's not surprising since the default settings should already be optimized. I also tried setting this in the application configuration:

<runtime>
  <Thread_UseAllCpuGroups enabled="true"></Thread_UseAllCpuGroups>
  <GCCpuGroup enabled="true"></GCCpuGroup>
  <gcServer enabled="true"></gcServer>
</runtime>

but this actually makes it run even slower.

EDIT: The ProcessFrame code block I quoted originally was actually not quite correct. What I was doing at the time was:

ProcessFrame(double[,] frame)
{
  byte[,] result = new byte[frame.GetLength(0), frame.GetLength(1)];
  Parallel.For(0, frame.GetLength(0), row =>
  {
    for(var col = 0; col < frame.GetLength(1); ++col)
      ManipulatePixel(frame[row, col]);
  });
}

Sorry for this, I was paraphrasing code at the time and I didn't realize that this is an actual pitfall, that produces different results. I have since changed the code to what I originally wrote (i.e. the width and height variables set at the beginning of the function, and the array's length properties only queried once each instead of in the for loop's conditional statements). Thank you @Seabizkit, your second comment inspired me to try this. The change in fact already makes the code run noticeably faster - I didn't realize this because C++ doesn't know 2D arrays so I had to pass the pixel dimensions as separate arguments anyway. Whether it is fast enough as it is I cannot say yet however.

Also thank you for the other answers, they contain a lot of things I don't know yet but it's great to know what I have to look for. I'll update once I reached a satisfactory result.

just to throw a spanner, 100% doesn't mean your doing it 100% efficiently just that what ever you asked it to do is making it work hard not necessary faster, hope that you understand what i mean. — Seabizkit, Sep 20 '19 at 10:32
further your code, there would be a lot of threads all interacting on frame which would slow things down. — Seabizkit, Sep 20 '19 at 10:34
Of couse CPU usage is not the same as code efficiency but that is not the issue here. I am comparing two ways of doing the exact same algorithm, one isn't using all the processing power available while the other one is. You'll have no chance to be competitive with different ways of doing the same task if you don't even use the whole processing power that's available to you. Why is the C++ code running so much faster than C#? As I mentioned, in C++ I have no trouble keeping up with the camera framerate, the steps done are the same, and the C++ parallel for loops also all access the frame. — flibbo, Sep 20 '19 at 11:03
Set gcServer enabled="true" in your .config, gargage collector could slow the application in multithread usage. — LeBigCat, Sep 20 '19 at 11:10
I can't replicate the issue. I added a dummy loop inside `ManipulatePixel`, and the CPU utilization goes to 99% and stays there till the end of the process. Configuration: Windows 10, AMD Athlon, .NET Framework 4.7.2, Release build, Any CPU, Prefer 32-bit. — Theodor Zoulias, Sep 20 '19 at 12:24
what is ManipulatePixel doing and what is it doing with frame, frame is shared across threads, can this be avoided. the point about 100% cpu i was trying to say was Dai answer last point. — Seabizkit, Sep 20 '19 at 12:55

Dai · Accepted Answer · 2019-09-20T11:55:50.157

I would need to have all of your code and be able to run it locally in order to diagnose the problem because your posting is devoid of details (I would need to see inside your ManipulatePixel function, as well as the code that calls ProcessFrame). but here's some general tips that apply in your case.

2D arrays in .NET are significantly slower than 1D arrays and staggered arrays, even in .NET Core today - this is a longstanding bug.
- See here:
- So consider changing your code to use either a jagged array (which also helps with memory locality/proximity caching, as each thread would have its own private buffer) or a 1D array with your own code being responsible for bounds-checking.
- Or better-yet: use stackalloc to manage the buffer's lifetime and pass that by-pointer (unsafe ahoy!) to your thread delegate.
Sharing memory buffers between threads makes it harder for the system to optimize safe memory accesses.
Avoid allocating a new buffer for each frame encountered - if a frame has a limited lifespan then consider using reusable buffers using a buffer-pool.
Consider using the SIMD and AVX features in .NET. While modern C/C++ compilers are smart enough to compile code to use those instructions, the .NET JIT isn't so hot - but you can make explicit calls into SMID/AVX instructions using the SIMD-enabled types (you'll need to use .NET Core 2.0 or later for the best accelerated functionality)
Also, avoid copying individual bytes or scalar values inside a for loop in C#, instead consider using Buffer.BlockCopy for bulk copy operations (as these can use hardware memory copy features).
Regarding your observation of "80% CPU usage" - if you have a loop in a program then that will cause 100% CPU usage within the time-slices provided by the operating-system - if you don't see 100% usage then your code then:
- Your code is actually running faster than real-time (this is a good thing!) - (unless you're certain your program can't keep-up with the input?)
- Your codes' thread (or threads) is blocked by something, such as a blocking IO call or a misplaced Thread.Sleep. Use tools like ETW to see what your process is doing when you think it should be CPU-bound.
- Ensure you aren't using any lock (Monitor) calls or using other thread or memory synchronization primitives.

❝Sharing memory buffers between threads makes it harder for the system to optimize safe memory accesses.❞ Could you elaborate on this? Assuming that there is zero thread-synchronization code, do you imply that the system still synchronizes the memory access somehow? — Theodor Zoulias, Sep 20 '19 at 12:42
So the first thing I tried was replacing all the 2D arrays with 1D arrays - however I saw no difference in performance as a result whatsoever... I'll look into your other suggestions next week — flibbo, Sep 20 '19 at 16:08
@flibbo using *just another **"indexing"*** of doing the samo amount of ill-organised, per-1pixel-only memory-accesses, distributed accros more RAM-non-local-(NUMA)-CPU-cores ( with all costs of "sharing", "sync"-ing and locking add-on costs and GC-in-determinism ) **will never make the game changer move** ( as the root-cause inefficiencies were all left in place ). **If an ultimate performance needs to be your younger brother, start from understanding how the complex of the RAM, mem-I/O, CPU, L1/2/3-cache actually works** & learn, where you may live with latency/throughput and what to avoid — user3666197, Sep 20 '19 at 17:29
@user3666197 I tried it because Dai suggested 2D arrays are slower than 1D arrays in C# because of a bug. Anyway, do you happen to have some sources about what you mentioned where I could read up on? — flibbo, Sep 20 '19 at 19:29

user3666197 · Answer 2 · 2019-09-20T12:37:26.323

_{Efficiency matters}^{( it is not true-[PARALLEL], but may, yet need not, benefit from a "just"-[CONCURRENT] work}

The BEST, yet a rather hard way, if ultimate performance is a MUST :

in-line an assembly, optimised as per cache-line sizes in the CPU hierarchy and keep indexing that follows the actual memory-layout of the 2D data { column-wise | row-wise }. Given there is no 2D-kernel-transformation mentioned, your process does not need to "touch" any topological-neighbours, the indexing can step in whatever order "across" both of the ranges of the 2D-domain and the ManipulatePixel() may get more efficient on transforming rather blocks-of pixels, instead of bearing all overheads for calling a process just for each isolated atomicised-1px ( ILP + cache-efficiency are on your side ).

Given your target production-platform CPU-family, best use (block-SIMD)-vectorised instructions available from AVX2, best AVX512 code. As you most probably know, may use C/C++ using AVX-intrinsics for performance optimisations with assembly-inspection and finally "copy" the best resulting assembly for your C# assembly-inlining. Nothing will run faster. Tricks with CPU-core affinity mapping and eviction/reservation are indeed a last resort, yet may help for indeed an almost hard-real-time production settings ( though, hard R/T systems are seldom to get developed in an ecosystem with non-deterministic behaviour )

A CHEAP, few-seconds step :

Test and benchmark the run-time per batch of frames of a reversed composition of moving the more-"expensive"-part, the Parallel.For(...{...}) inside the for(var col = 0; col < width; ++col){...} to see the change of the costs of instantiations of the Parallel.For() instrumentation.

Next, if going this cheap way, think about re-factoring the ManipulatePixel() to at least use a block of data, aligned with data-storage layout and being a multiple of cache-line length ( for cache-hits ~ 0.5 ~ 5 [ns] improved costs-of-memory accesses, being ~ 100 ~ 380 [ns] otherwise - here, a will to distribute a work (the worse per 1px) across all NUMA-CPU-cores will result in paying way more time, due to extended access-latencies for cross-NUMA-(non-local) memory addresses and besides never re-using an expensively cached block-of-fetched-data, you knowingly pay excessive costs from cross-NUMA-(non-local) memory fetches ( from which you "use" just 1px and "throw" away all the rest of the cached-block ( as those pixels will get re-fetched and manipulated in some other CPU-core in some other time ~ a triple-waste of time ~ sorry to have mentioned that explicitly, but when shaving each possible [ns] this cannot happen in production pipeline ) )

Anyway, let me wish you perseverance and good luck on your steps forwards to gain the needed efficiency back onto your side.

score 0 · Answer 3 · answered Sep 23 '19 at 14:19

Here's what I ended up doing, mostly based on Dai's answer:

made sure to query image pixel dimensions once at the beginning of the processing functions, not within the for loop's conditional statement. With parallel loops, it would seem this creates competitive access of those properties from multriple threads which noticeably slows things down.
removed allocation of output buffers within the processing functions. They now return void and accept the output buffer as an argument. The caller creates one buffer for each image processing step (filtering, scaling, colorizing) only, which doesn't change in size but gets overwritten with each frame.
removed an extra data processing step where raw image data in the format ushort (what the camera originally spits out) was converted to double (actual temperature values). Instead, processing is applied to the raw data directly. Conversion to actual temperatures will be dealt with later, as necessary.

I also tried, without success, to use 1D arrays instead of 2D but there is actually no difference in performance. I don't know if it's because the bug Dai mentioned was fixed in the meantime, but I couldn't confirm 2D arrays to be any slower than 1D arrays.

Probably also worth mentioning, the ManipulatePixel() function in my original post was actually more of a placeholder rather than a real call to another function. Here's a more proper example of what I am doing to a frame, including the changes I made:

private static void Rescale(ushort[,] originalImg, byte[,] scaledImg, in (ushort, ushort) limits)
{
  Debug.Assert(originalImg != null);
  Debug.Assert(originalImg.Length != 0);
  Debug.Assert(scaledImg != null);
  Debug.Assert(scaledImg.Length == originalImg.Length);

  ushort min = limits.Item1;
  ushort max = limits.Item2;
  int width = originalImg.GetLength(1);
  int height = originalImg.GetLength(0);

  Parallel.For(0, height, row =>
  {
    for (var col = 0; col < width; ++col)
    {
      ushort value = originalImg[row, col];
      if (value < min)
        scaledImg[row, col] = 0;
      else if (value > max)
        scaledImg[row, col] = 255;
      else
        scaledImg[row, col] = (byte)(255.0 * (value - min) / (max - min));
    }
  });
}

This is just one step and some others are much more complex but the approach would be similar.

Some of the things mentioned like SIMD/AVX or the answer of user3666197 unfortunately are well beyond my abilities right now so I couldn't test that out.

It's still relatively easy to put enough processing load into the stream to tank the frame rate but for my application the performance should be enough now. Thanks to everyone who provided input, I'll mark Dai's answer as accepted because I found it the most helpful.

You may see a slight improvement if you use a [`Partitioner`](https://learn.microsoft.com/en-us/dotnet/api/system.collections.concurrent.partitioner) to reduce the granularity of the parallelization: `Parallel.ForEach(Partitioner.Create(0, height), range => { for (int row = range.Item1; row < range.Item2; row++) ...` — Theodor Zoulias, Sep 23 '19 at 15:32

Full CPU usage for Parallel.For loops

3 Answers3

The BEST, yet a rather hard way, if ultimate performance is a MUST :

A CHEAP, few-seconds step :