This question is related to another question of mine, titled Calling MASM PROC from C++/CLI in x64 mode yields unexpected performance problems. I din't receive any comments and answers, but eventually I found out myself that the problem is caused by function thunks that are inserted by the compiler whenever a managed function calls an unmanaged one, and vice versa. I won't go into the details once again, because today I wan't to focus on another consequence of this tunking mechanism.
To provide some context for the question, my problem was the replacement of a C++ function for 64-to-128-bit unsigned integer multiplication in an unmanaged C++/CLI class by a function in an MASM64 file for the sake of performance. The ASM replacement is as simple as can be:
AsmMul1 proc ; ?AsmMul1@@$$FYAX_K0AEA_K1@Z
; ecx : Factor1
; edx : Factor2
; [r8] : ProductL
; [r9] : ProductH
mov rax, rcx ; rax = Factor1
mul rdx ; rdx:rax = Factor1 * Factor2
mov qword ptr [r8], rax ; [r8] = ProductL
mov qword ptr [r9], rdx ; [r9] = ProductH
ret
AsmMul1 endp
I expected a big performance boost by replacing a compiled function with four 32-to-64-bit multiplications with a simple CPU MUL instruction. The big surprise was that the ASM version was about four times slower (!) than the C++ version. After a lot of research and testing, I found out that some function calls in C++/CLI involve thunking, which obviously is such a complex thing that it takes much more time than the thunked function itself.
After reading more about this thunking, it turned out that whenever you are using the compiler option /clr
, the calling convention of all functions is silently changed to __clrcall, which means that they become managed functions. Exceptions are functions that use compiler intrinsics, inline ASM, and calls to other DLLs via dllimport - and as my tests revealed, this seems to include functions that call external ASM functions.
As long as all interacting functions use the __clrcall convention (i.e. are managed), no thunking is involved, and everything runs smoothly. As soon as the managed/unmanaged boundary is crossed in either direction, thunking kicks in, and performance is seriously degraded.
Now, after this long prologue, let's get to the core of my question. As far as I understand the __clrcall
convention, and the /clr
compiler switch, marking a function in an unmanaged C++ class this way causes the compiler to emit MSIL code. I've found this sentence in the documentation of __clrcall:
When marking a function as __clrcall, you indicate the function implementation must be MSIL and that the native entry point function will not be generated.
Frankly, this is scaring me! After all, I'm going through the hassles of writing C++/CLI code in order to get real native code, i.e. super-fast x64 machine code. However, this doesn't seem to be the default for mixed assemblies. Please correct me if I'm getting it wrong: If I'm using the project defaults given by VC2017, my assembly contains MSIL, which will be JIT-compiled. True?
There is a #pragma managed
that seems to inhibit the generation of MSIL in favor of native code on a per-function basis. I've tested it, and it works, but then the problem is that thunking gets in the way again as soon as the native code calls a managed function, and vice versa. In my C++/CLI project, I found no way to configure the thunking and code generation without getting a performance hit at some place.
So what I'm asking myself now: What's the point in using C++/CLI in the first place? Does it give me performance advantages, when everything is still compiled to MSIL? Maybe it's better to write everything in pure C++ and use Pinvoke to call those functions? I don't know, I'm kind of stuck here.
Maybe someone can shed some light on this terribly poorly documented topic...