I have a matrix struct on C# with the multiplications operations implemented without using SSE intrinsics. As I don't have access to the code at this very moment, I'll try to specify details as much as I can rather than copy/pasting the definition. I can edit the post in the morning to include relevant definitions if need be.
The struct has 16 float
s defined as M11, M12, M13, ..., M43, M44'
with the sequential layout specified: [StructLayout(LayoutKind.Sequential)]
The C++ function is declared with the attribute specification
[DllImport("cppCode.dll", EntryPoint = "MatrixMultiply", CallingConvention = CallingConvention::Cdecl]
I'm trying to make a call to a C++ function using P/Invoke for optimizing the multiplications. My question is about passing the parameters. As mentioned on MSDN, the cost is 10 to 30 cycles of CPU + marshalling if the type passed is not blittable.
The function call on C# looks like
MatrixMultiply(ref matrix1, ref matrix2, out matrix_out);
and the C++ counterpart receives them with mat*
, with mat
being the matching C++ struct with 4x vec4
s.
static extern void MatrixMultiply(mat* m1, mat* m2, mat* out) { *out = *m1 * *m2; }
When the calculations are profiled, the gain is quite minimal - a microsecond or two - on the average case. However, the worst case became worse, from 150us with C# multiplication to 400us with C++ multiplication, which leads me to think that the overhead for calling a function from the exported dll almost eliminates the gain from SSE instructions.
As I have limited familiarity with C#, I can't tell for sure what's going on. Am I doing something wrong? Is there a faster approach for C#/C++ communication in this particular case?