3

Profiling my application reveals that 50% of runtime is being spent in a packArrays() function which performs array transformations where C++ strongly outperforms C#.

In order to improve performance, I used unsafe in packArrays to gain only low single digit percentage improvements in runtime. In order to eliminate cache as the bottleneck and in order to estimate the ceiling of performance improvement, I wrote packArrays in C++ and timed the difference in both languages. The C++ version runs approx 5x faster than C#. I decided to give C++/CLI a try.

As a result, I have three implementations:

  1. C++ - a simple packArrays() function
  2. C# - packArrays() is wrapped into a class, however the code inside the function is identical to the C++ version
  3. C++/CLI - shown below, but again the implementation of packArrays() is identical (literally) to the previous two

The C++/CLI implementation is as follows

QCppCliPackArrays.cpp

public ref class QCppCliPackArrays
{
     void pack(array<bool> ^ xBoolArray, int xLen, array<int> ^% yBoolArray, int % yLen)
        {
            // prepare variables
            pin_ptr<bool> xBoolArrayPinned = &xBoolArray[0];
            bool * xBoolArray_ = xBarsAreTruePinned;

            pin_ptr<bool> yBoolArrayPinned = &yBoolArray[0];
            bool * yBoolArray_ = yBarsAreTruePinned;

            // go
            packArrays(xBoolArray_, xBarCount, yBoolArray_ , yLen);
        }
};

packArraysWorker.cpp

#pragma managed(push, off)
    void packArrays(bool * xArray, int xLen, bool * yArray, int & yLen)
    {
        ... actual code that is identical across languages code ...
    }
#pragma managed(pop)

QCppCliPackArrays.cpp is compiled with \clr option, packArraysWorker.cpp is compiled with No Common Language RunTime Support option.

The problem: When using a C# application to run both C# and C++/CLI implementations, C++/CLI implementation is still only marginally faster than C#.

Questions:

  1. Is there any other option/setting/keyword I can use to increase the performance of C++/CLI?
  2. Can the performance loss of C++/CLI compared to C++ be wholely attributed to interop? Currently, for 10K repetitions C# runs some 4.5 seconds slower than C++, giving interop 0.45 millisecond per repetition. As all types being passed are blittable, I would expect the interop to .. well just pass over some pointers.
  3. Would I gain anything by using P/Invoke? From what I read not, but it's always better to ask.
  4. Is there any other method I can use? Leaving a five-fold increase in performance on the table is just too much.

All timings are made in Release/x64 from the command line (not from VS) on a single thread.

EDIT:

In order to determine the performance loss due to interop, I placed a Stopwatch around the QCppCliPackArrays::packArrays() call as well a chrono::high_resolution_clock inside the packArrays() per se. The results show that The C# <-> C++/CLI switch costs approx. 5 milliseconds per 10K calls. The switch from managed C++/CLI to unmanaged C++/CLI, according to results, costs nothing.

Hence, interop can be ruled out as the cause of performange degradation.

On the other hand, its obvious that packArrays() is NOT run as unmanaged! But why?

EDIT 2: I tried to link the packArrays() as a .lib file exported from a separate unmanaged C++ library. Results are still the same.

EDIT 3: The actual packArrays is this

public void packArrays(bool[] xConditions, int[] xValues, int xLen, ref int[] yValuesPacked, ref int yPackedLen)
{
    // alloc
    yPackedLen = xConditions.trueCount();
    yValuesPacked = new int [yPackedLen];

    // fill
    int xPackedIdx  = 0;
    for (int xIdx = 0; xIdx < xLen; xIdx++)
        if (xConditions[xIdx] == true)
            yValuesPacked[xPackedIdx++] = xValues[xIdx];
}

into yValuesPacked puts all values from xValues where the corresponding xConditions[i] is true.

Now, I am facing a new issue - I have several implementations aiming to solve this problem, all of them work correctly (tested). When I run a benchmark that invididually calls these different implementations 50K times on arrays 86K items long, I get the following timinigs in seconds:

benchmark timings

The original implementation originalArray is the code listed above. Clearly, the QCsCpp* versions dominate the benchmark - these are the implementations using C++/CLI. However, when I replace originalArrayin my original application, that calls packArrays a vast number of times, with either QCsCpp* implementation, the whole application runs SLOWER. With this result, I am really clueless and I must admit that it honestly crushed me. How can this be true? As always, any insight is much appreciated.

Deduplicator
  • 44,692
  • 7
  • 66
  • 118
Daniel Bencik
  • 959
  • 1
  • 8
  • 32
  • what "array Type" do you use in C#? List for example has a capacity and a count. https://msdn.microsoft.com/en-us/library/y52x03h2(v=vs.110).aspx – Mat Aug 10 '16 at 10:21
  • @Mat: only bool [] for performance reasons. – Daniel Bencik Aug 10 '16 at 10:22
  • @Mat: yes, currently having 84K items. – Daniel Bencik Aug 10 '16 at 10:27
  • 1
    Your profile results can't be accurate. There is nothing special about C# calling a C++/CLI function, 500 nanoseconds is entirely too long. Some odds that your measurement included the cost of jitting the function, a one-time cost you have to pay, avoided by using Ngen. And there really *is* a cost to calling native code, beyond constructing the stack frame you always pay for the transition, a "cookie" has to be written to the stack that prevents the GC from blundering into native stack frames. Benchmarks don't repeat well, some odds you are measuring the cost of accessing memory. – Hans Passant Aug 10 '16 at 13:35
  • @Hans Passant: You are right. I wrote another benchmark with different results. The error in my original benchmark however remains a mystery. The lenght of operations outside the StopWatch.Start()/Stopwatch.Stop() commands heavily influence StopWatch.Elapsed ... – Daniel Bencik Aug 10 '16 at 14:53
  • Are you using .NET 4.5? – lsalamon Aug 10 '16 at 15:41
  • @Isalomon: .NET 4.0 – Daniel Bencik Aug 10 '16 at 16:10
  • You should post the actual code in `packArrays`, because right now any *"answer"* would have to be a guess. See [mcve]. – Lucas Trzesniewski Aug 12 '16 at 09:30
  • @LucasTrzesniewski: I have made an edit. Thank you. – Daniel Bencik Aug 15 '16 at 16:20
  • Did you use the exact same data in all your tests? Your function is sensitive to [branch mispredictions](http://stackoverflow.com/q/11227809/3764814). Also, managed arrays have bounds checking built-in, and that may slow the code down too. – Lucas Trzesniewski Aug 15 '16 at 18:42
  • @LucasTrzesniewski: Yes, I used exactly the same data in both the benchmark and the real application. About branching, I know, but that's unavoidable. I checked the effect of boundary checking in managed arrays and the effect is relatively small, low single-digit percentage points. – Daniel Bencik Aug 15 '16 at 18:52

0 Answers0