Expensive C# operation, looking to improve performance

Question

This will be a long question, sorry in advance. I don't expect a full code solution, I am looking for some input from people with a different perspective and some more experience than me.

My company is developing software for a product that does some rather expensive calculations using a film from an IR camera where every pixel contains a temperature value. The most costly of those methods is called Thermal Signal Reconstruction (if you are interested, you can read about it here https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4321698/ ). It basically performs a polynomial fit for each pixel over time (the number of frames). My C# implementation looks something like this:

public static double[,,] ThermalSignalReconstruction(List<Frame<double>> thermalFilm, byte polyOrder)
{
  Resolution filmResolution = thermalFilm[0].Resolution;
  uint width = filmResolution.Width;
  uint height = filmResolution.Height;
  int frames = thermalFilm.Count;
  double[,,] result = new double[polyOrder + 1, height, width];

  // using frame indexes as x-values for poly fit
  List<double> frameIndexes = new List<double>(frames);
  for (var frame = 0U; frame < frames; ++frame)
    frameIndexes.Add(frame);

  // make local copy of thermal film and fill with difference images
  List<Frame<double>> localThermalFilm = new List<Frame<double>>(frames);
  for (var frame = 0U; frame < frames; ++frame)
    localThermalFilm.Add(new Frame<double>(filmResolution));
  Parallel.For(0U, frames, frame =>
  {
    for (var row = 0U; row < height; ++row)
      for (var col = 0U; col < width; ++col)
        localThermalFilm[(int)frame].Data[row, col] = thermalFilm[(int)frame].Data[row, col] - thermalFilm[0].Data[row, col];
  });

  // determine flashpoint by finding the frame with the maximum average pixel value
  double maxAverage = double.MinValue;
  uint maxIndex = 0U;
  Parallel.For(0U, frames, frame =>
  {
    double average = Math.MatrixMean(localThermalFilm[(int)frame].Data);
    if (average > maxAverage)
    {
      maxAverage = average;
      maxIndex = (uint)frame;
    }
  });
  // remove frames preceeding flashpoint, including itself, from film
  localThermalFilm.RemoveRange(0, (int)maxIndex + 1);
  frameIndexes.RemoveRange(0, (int)maxIndex + 1);
  frames -= (int)maxIndex + 1;

  // calculate base 10 logarithm of all pixels and frame indexes
  Parallel.For(0U, frames, frame =>
  {
    for (var row = 0U; row < height; ++row)
      for (var col = 0U; col < width; ++col)
        localThermalFilm[(int)frame].Data[row, col] = System.Math.Log10(localThermalFilm[(int)frame].Data[row, col]);
    frameIndexes[(int)frame] = System.Math.Log10(frameIndexes[(int)frame]);
  });

  // perform polynomial fit for each pixel
  Parallel.For(0U, height, row =>
  {
    for (var col = 0U; col < width; ++col)
    {
      // extract poly fit input y-values for current pixel
      double[] pixelValues = new double[frames];
      for (var frame = 0U; frame < frames; ++frame)
        pixelValues[frame] = localThermalFilm[(int)frame].Data[row, col];

      // (...) do some value validations

      // poly fit for current pixel - this is the longest step
      double[] coefficients = Math.PolynomialRegression(frameIndexesValidated.ToArray(), pixelValuesValidated.ToArray(), polyOrder);

      // insert into coefficient images result array
      for (var coefficient = 0U; coefficient < result.GetLength(0); ++coefficient)
        result[coefficient, row, col] = coefficients[coefficient];
    }
  });

  return result;
}

As you can see, several parallelized loops performing several operations on the frames are executed in sequence with the polynomial fit (Math.PolynomialRegression) being the last and most expensive one. This is a function containing a polynomial fit algorithm I pieced together myself since it doesn't exist in the standard System.Math library and the only other one I tried from the Math.NET library actually runs slower than the one I wrote. My code is based on examples given on Rosetta Code: https://rosettacode.org/wiki/Polynomial_regression

My point is, that I wrote this entire algorithm before in unmanaged C++, our company decided to move away from that because of some licensing issues with the GUI framework we were using back then and to instead now use C#/.NET. A direct comparison of the old unmanaged C++ code with the one I posted above in managed C# recently showed me that the C# code takes about 48% (!!!) longer to execute than the C++ code even though the algorithm is identical. I am aware that C# is a higher level, managed language and therefore has a greater translation distance than C++ so I fully expected it to run slower, but I didn't expect it to be this bad. 48% is quite a big deal which leads me to believe that I may be doing something wrong. On the other hand, I don't have that much experience yet so if I'm perfectly honest I also don't really know what to expect in a situation like this.

What I have tried so far:

switching between running the various individual loops sequentially and parallelized, it's fastest with all of them parallelized like above
adjusting the variables that are accessed by the individual parallelized loop instances (e.g. not accessing the same resolution object each time but declaring separate variables for width and height before starting the loop) which already improved performance quite a bit, but the 48% are still left after that
trying a Parallel.ForEach(Partitioner.Create(0, frames) ... ) approach, i.e. partitioning the chunks of data more coarsely with the Partitioner class, it didn't help, made the code run slower
optimizing other functions that are called as well as code on the side of this one's caller as best I can

To home in to the question: is it even possible to make a C# code like this run with comparable performance than the same code in C++, if so how? Or is what I observed perfectly normal and I have to deal with it?

EDIT: added the first three loop bodies in TSR per request and my polynomial regression implementation looks like this:

    public static double[] PolynomialRegression(in double[] xValues, in double[] yValues, byte order)
    {
      Debug.Assert(xValues != null && yValues != null);
      Debug.Assert(xValues.Length == yValues.Length);
      Debug.Assert(xValues.Length != 0 || yValues.Length != 0);

      int dataSamples = xValues.Length;
      double[] result = new double[order + 1];

      // array containing N,sigma(xi),sigma(xi^2),sigma(xi^3)...sigma(xi^2*poly_order), where N=number of samples
      double[] sigmaX = new double[2 * order + 1];
      for (var index = 0U; index < sigmaX.Length; ++index)
      {
        sigmaX[index] = 0.0;
        for (var dataPoint = 0U; dataPoint < dataSamples; ++dataPoint)
          sigmaX[index] += System.Math.Pow(xValues[(int)dataPoint], index);
      }

      // array containing sigma(yi),sigma(xi*yi),sigma(xi^2*yi)...sigma(xi^poly_order*yi)
      double[] sigmaY = new double[order + 1];
      for (var pOrder = 0U; pOrder < sigmaY.Length; ++pOrder)
      {
        sigmaY[pOrder] = 0.0;
        for (var dataPoint = 0U; dataPoint < dataSamples; ++dataPoint)
          sigmaY[pOrder] += System.Math.Pow(xValues[(int)dataPoint], pOrder) * yValues[(int)dataPoint];
      }

      // equation system's augmented normal matrix
      int matrixRows = order + 1;
      int matrixCols = order + 2;
      double[,] matrix = new double[matrixRows, matrixCols];
      for (var row = 0U; row < matrixRows; ++row)
        for (var col = 0U; col < matrixCols - 1; ++col)
          matrix[row, col] = sigmaX[row + col];
      for (var row = 0U; row < matrixRows; ++row)
        matrix[row, order + 1] = sigmaY[row];

      // pivotisation of matrix
      for (var pivotRow = 0U; pivotRow < matrixRows; ++pivotRow)
        for (var lowerRow = pivotRow + 1U; lowerRow < matrixRows; ++lowerRow)
          if (matrix[pivotRow, pivotRow] < matrix[lowerRow, pivotRow])
            for (var col = 0U; col < matrixCols; ++col)
            {
              double temp = matrix[pivotRow, col];
              matrix[pivotRow, col] = matrix[lowerRow, col];
              matrix[lowerRow, col] = temp;
            }

      // Gaussian elimination
      for (var pivotRow = 0U; pivotRow < matrixRows; ++pivotRow)
        for (var lowerRow = pivotRow + 1U; lowerRow < matrixRows; ++lowerRow)
        {
          double ratio = matrix[lowerRow, pivotRow] / matrix[pivotRow, pivotRow];
          for (var col = 0U; col < matrixCols; ++col)
            matrix[lowerRow, col] -= ratio * matrix[pivotRow, col];
        }

      // back-substitution
      for (var row = (short)order; row >= 0; --row)
      {
        result[row] = matrix[row, order + 1];
        for (var col = 0U; col < matrixCols - 1; ++col)
          if (col != row)
            result[row] -= matrix[row, col] * result[col];
        result[row] /= matrix[row, row];
      }

      return result;
    }

"*code takes about 48%(!!!) longer*" - all things considered that isn't bad, if you can id used `fixed` and pointers everywhere you can. however you are going to run int lambda issues but they are not insurmountable — TheGeneral, Feb 07 '20 at 10:15
Image manipulation is time consuming. Perhaps you could use a proper library for this sort of thing? Or just call out to some C++ libraries from your C# code. — DavidG, Feb 07 '20 at 10:19
Ok, I see a few things that could use some improving. 1: Is it absolutely necessary to run your 4 `Parallel.For` in sequence? Couldn't you concat the computations you are doing in (at least) your first 3 loops in one loop? 2: I see a triple(!) nested for loop making your method at least a `0(N^3)` time complexity (which is horrendous). Could you post your code in your first 3 for loops as well so I can check if we can concat them? — MindSwipe, Feb 07 '20 at 10:20
Btw, the .NET `Parallel.For` is a blocking operation, meaning it will wait until it has enumerated every last element and then continue with the next one. More info here https://stackoverflow.com/q/10153729/9363973 — MindSwipe, Feb 07 '20 at 10:25
This question should be asked on the [codereview](https://codereview.stackexchange.com/) site rather than here. — devsmn, Feb 07 '20 at 10:27
@Shawn kinda debatable. There are plenty of questions on SO asking about why something is slow and how to improve it with a working code sample (e.g [this](https://stackoverflow.com/questions/59962530/the-query-is-executing-very-slowly-is-there-any-way-to-improve-it-any-further) one), but the premise of "Hey I have some working code, how can I improve it" fits much better on the Codereview SE — MindSwipe, Feb 07 '20 at 10:32
@MindSwipe I edited the question to include the first three loop bodies as well — flibbo, Feb 07 '20 at 10:34
@Shawn sry I'm still new to SO, I'll note it for future reference — flibbo, Feb 07 '20 at 10:34
BTW: Just mentioning, this is a perfect scenario for Go and its lightweight concurrent Goroutines. — Rob, Feb 07 '20 at 10:39
"poly fit for current pixel - this is the longest step" - it is also the step we don't have any code for; maybe some SIMD would be appropriate in there? or indeed in the other places? — Marc Gravell, Feb 07 '20 at 10:40
What is `frameIndexesValidated` and `pixelValuesValidated`? Those may be lazily evaluated and the `ToArray()` would cause execution, which could make that line seem to take longer that it actually would. -- You work with arrays everywhere else, so why do these need to be "ToArray'd"? — Corak, Feb 07 '20 at 10:41
@Corak these are the results of value validations I did but haven't included in the question for the sake of simplicity. I filter out any values in the pixelValues and frameIndexes lists that ended up NaN or -Inf after taking Log10 as they would break the poly fit. They are lists themselves, hence ToArray(). — flibbo, Feb 07 '20 at 10:43
@MarcGravell OP linked rosettacode in his question, specifically, the rosettacode snipped for [polynomial regression](https://rosettacode.org/wiki/Polynomial_regression#C.23) — MindSwipe, Feb 07 '20 at 10:44
@MindSwipe - OP also mentioned their solution is "based on" those examples. And also that using Math.Net (which the C# rosettacode does) is "slower". -- So having the actual code instead of extrapolating/guessing might be beneficial. — Corak, Feb 07 '20 at 10:48
I added the PolynomialRegression code. Now it's really a wall of text though, thanks for taking the time to look at it! — flibbo, Feb 07 '20 at 10:51
You can write the perf critical in C/C++ and keep the GUI of your choice. You can make c calls from C# and a lot of other languages. — bolov, Feb 07 '20 at 11:01
@MindSwipe Re: Code Review, this is a broad-enough request that it really *does* belong there rather than here. It's basically asking for a full re-write. — TylerH, Feb 07 '20 at 16:39
@flibbo You can simply re-ask it there; there's no way to migrate it within the system short of diamond moderators manually initiating the process. — TylerH, Feb 07 '20 at 19:03
There are other things you can do this to make it faster again, try and convert everything to fixed one dimensional pointer arrays, access them by memory address, cache arrays outside of loops so you aren't reallocating them all the time , every time you got dot.something be wary. In fairness it will be hard and ugly yet you will see performance increases... — TheGeneral, Feb 07 '20 at 22:40

flibbo · Accepted Answer · 2020-02-07T17:25:34.927

Thank you to everyone who commented. I have tried some of the suggestions: rewriting the PolynomialRegression method using fixed pointers, this did not have any effect. I also combined some of the loops in the TSR method, I now just have two parallel loops running in sequence (need the first one definitely to find the flash point), this helped but only little bit (something like 1m21s instead of 1m26s).

Then I analyzed the code a little more with the VS CPU profiler. 85% of all CPU cycles within the TSR method were done in the PolynomialRegression method, as expected this one does the bulk of the work. Within polynomial regression however was what surprised me: the System.Math.Pow method is actually a huge bottleneck.

the first call: sigmaX[index] += System.Math.Pow(xValues[(int)dataPoint], index); within the first loop owned about 55% of the CPU cycles
the second call in the second loop: sigmaY[pOrder] += System.Math.Pow(xValues[(int)dataPoint], pOrder) * yValues[(int)dataPoint]; owned about 26%

the other steps, even the big matrix pivotisation and Gaussian elimination steps were almost negligible by comparison. I gathered this is because System.Math.Pow is the general implementation of the exponentiation problem, thus including all sorts of checking for negative and fractional exponents. Since in my current problem I only ever had positive integer exponents I wrote my own specialized method instead:

    public static double UIntPow(double @base, uint power)
    {
      if (power == 0)
        return 1.0;
      else if (power == 1)
        return @base;
      else
        return @base * UIntPow(@base, power - 1);
    }

While this recursive method runs extremely slowly in debug mode (about twice as slowly as System.Math.Pow), it actually is very fast in the release build where code gets optimized. The execution of TSR now actually runs faster than the C++ equivalent, although I assume I could have gotten the same performance improvement there if I had also used my own UIntPow method.

Again thanks to everyone who took the time to look at my problem and maybe this solution helps someone in the future.

EDIT: Thanks again for the input! This algorithm runs even faster than my recursive attempt:

    public static double UIntPow(double @base, uint power)
    {
      double result = 1.0;
      while (power != 0)
      {
        if ((power & 1) == 1)
          result *= @base;
        @base *= @base;
        power >>= 1;
      }
      return result;
    }

Well done for profiling! I would definitely make your `UIntPow` method iterative rather than recursive -- it's trivial to do, and you don't risk blowing the stack if the optimizer doesn't manage to deal with it. Look at [exponentiation by squaring](https://en.wikipedia.org/wiki/Exponentiation_by_squaring) for a faster way to do this for large powers. Note also that `Math.Pow` is slow in part because it works on `double`s, and floating-point is almost always slower than integer arithmetic. — canton7, Feb 07 '20 at 16:35
See also: https://stackoverflow.com/questions/383587/how-do-you-do-integer-exponentiation-in-c and https://stackoverflow.com/a/11419400/1086121 — canton7, Feb 07 '20 at 16:37

Expensive C# operation, looking to improve performance

1 Answers1