Profiling my application reveals that 50% of runtime is being spent in a packArrays()
function which performs array transformations where C++ strongly outperforms C#.
In order to improve performance, I used unsafe
in packArrays
to gain only low single digit percentage improvements in runtime. In order to eliminate cache as the bottleneck and in order to estimate the ceiling of performance improvement, I wrote packArrays
in C++ and timed the difference in both languages. The C++ version runs approx 5x faster than C#. I decided to give C++/CLI a try.
As a result, I have three implementations:
- C++ - a simple
packArrays()
function - C# -
packArrays()
is wrapped into a class, however the code inside the function is identical to the C++ version - C++/CLI - shown below, but again the implementation of
packArrays()
is identical (literally) to the previous two
The C++/CLI implementation is as follows
QCppCliPackArrays.cpp
public ref class QCppCliPackArrays
{
void pack(array<bool> ^ xBoolArray, int xLen, array<int> ^% yBoolArray, int % yLen)
{
// prepare variables
pin_ptr<bool> xBoolArrayPinned = &xBoolArray[0];
bool * xBoolArray_ = xBarsAreTruePinned;
pin_ptr<bool> yBoolArrayPinned = &yBoolArray[0];
bool * yBoolArray_ = yBarsAreTruePinned;
// go
packArrays(xBoolArray_, xBarCount, yBoolArray_ , yLen);
}
};
packArraysWorker.cpp
#pragma managed(push, off)
void packArrays(bool * xArray, int xLen, bool * yArray, int & yLen)
{
... actual code that is identical across languages code ...
}
#pragma managed(pop)
QCppCliPackArrays.cpp
is compiled with \clr
option, packArraysWorker.cpp
is compiled with No Common Language RunTime Support
option.
The problem: When using a C# application to run both C# and C++/CLI implementations, C++/CLI implementation is still only marginally faster than C#.
Questions:
- Is there any other option/setting/keyword I can use to increase the performance of C++/CLI?
- Can the performance loss of C++/CLI compared to C++ be wholely attributed to interop? Currently, for 10K repetitions C# runs some 4.5 seconds slower than C++, giving interop 0.45 millisecond per repetition. As all types being passed are blittable, I would expect the interop to .. well just pass over some pointers.
- Would I gain anything by using P/Invoke? From what I read not, but it's always better to ask.
- Is there any other method I can use? Leaving a five-fold increase in performance on the table is just too much.
All timings are made in Release/x64 from the command line (not from VS) on a single thread.
EDIT:
In order to determine the performance loss due to interop, I placed a Stopwatch
around the QCppCliPackArrays::packArrays()
call as well a chrono::high_resolution_clock
inside the packArrays()
per se. The results show that The C# <-> C++/CLI switch costs approx. 5 milliseconds per 10K calls. The switch from managed C++/CLI to unmanaged C++/CLI, according to results, costs nothing.
Hence, interop can be ruled out as the cause of performange degradation.
On the other hand, its obvious that packArrays()
is NOT run as unmanaged! But why?
EDIT 2:
I tried to link the packArrays()
as a .lib file exported from a separate unmanaged C++ library. Results are still the same.
EDIT 3:
The actual packArrays
is this
public void packArrays(bool[] xConditions, int[] xValues, int xLen, ref int[] yValuesPacked, ref int yPackedLen)
{
// alloc
yPackedLen = xConditions.trueCount();
yValuesPacked = new int [yPackedLen];
// fill
int xPackedIdx = 0;
for (int xIdx = 0; xIdx < xLen; xIdx++)
if (xConditions[xIdx] == true)
yValuesPacked[xPackedIdx++] = xValues[xIdx];
}
into yValuesPacked
puts all values from xValues
where the corresponding xConditions[i]
is true.
Now, I am facing a new issue - I have several implementations aiming to solve this problem, all of them work correctly (tested). When I run a benchmark that invididually calls these different implementations 50K times on arrays 86K items long, I get the following timinigs in seconds:
The original implementation originalArray
is the code listed above. Clearly, the QCsCpp* versions dominate the benchmark - these are the implementations using C++/CLI. However, when I replace originalArray
in my original application, that calls packArrays
a vast number of times, with either QCsCpp* implementation, the whole application runs SLOWER. With this result, I am really clueless and I must admit that it honestly crushed me. How can this be true? As always, any insight is much appreciated.