0

This question is related to another question of mine, titled Calling MASM PROC from C++/CLI in x64 mode yields unexpected performance problems. I din't receive any comments and answers, but eventually I found out myself that the problem is caused by function thunks that are inserted by the compiler whenever a managed function calls an unmanaged one, and vice versa. I won't go into the details once again, because today I wan't to focus on another consequence of this tunking mechanism.

To provide some context for the question, my problem was the replacement of a C++ function for 64-to-128-bit unsigned integer multiplication in an unmanaged C++/CLI class by a function in an MASM64 file for the sake of performance. The ASM replacement is as simple as can be:

AsmMul1 proc ; ?AsmMul1@@$$FYAX_K0AEA_K1@Z

; ecx  : Factor1
; edx  : Factor2
; [r8] : ProductL
; [r9] : ProductH

mov  rax, rcx            ; rax = Factor1
mul  rdx                 ; rdx:rax = Factor1 * Factor2
mov  qword ptr [r8], rax ; [r8] = ProductL
mov  qword ptr [r9], rdx ; [r9] = ProductH
ret

AsmMul1 endp

I expected a big performance boost by replacing a compiled function with four 32-to-64-bit multiplications with a simple CPU MUL instruction. The big surprise was that the ASM version was about four times slower (!) than the C++ version. After a lot of research and testing, I found out that some function calls in C++/CLI involve thunking, which obviously is such a complex thing that it takes much more time than the thunked function itself.

After reading more about this thunking, it turned out that whenever you are using the compiler option /clr, the calling convention of all functions is silently changed to __clrcall, which means that they become managed functions. Exceptions are functions that use compiler intrinsics, inline ASM, and calls to other DLLs via dllimport - and as my tests revealed, this seems to include functions that call external ASM functions.

As long as all interacting functions use the __clrcall convention (i.e. are managed), no thunking is involved, and everything runs smoothly. As soon as the managed/unmanaged boundary is crossed in either direction, thunking kicks in, and performance is seriously degraded.

Now, after this long prologue, let's get to the core of my question. As far as I understand the __clrcall convention, and the /clr compiler switch, marking a function in an unmanaged C++ class this way causes the compiler to emit MSIL code. I've found this sentence in the documentation of __clrcall:

When marking a function as __clrcall, you indicate the function implementation must be MSIL and that the native entry point function will not be generated.

Frankly, this is scaring me! After all, I'm going through the hassles of writing C++/CLI code in order to get real native code, i.e. super-fast x64 machine code. However, this doesn't seem to be the default for mixed assemblies. Please correct me if I'm getting it wrong: If I'm using the project defaults given by VC2017, my assembly contains MSIL, which will be JIT-compiled. True?

There is a #pragma managed that seems to inhibit the generation of MSIL in favor of native code on a per-function basis. I've tested it, and it works, but then the problem is that thunking gets in the way again as soon as the native code calls a managed function, and vice versa. In my C++/CLI project, I found no way to configure the thunking and code generation without getting a performance hit at some place.

So what I'm asking myself now: What's the point in using C++/CLI in the first place? Does it give me performance advantages, when everything is still compiled to MSIL? Maybe it's better to write everything in pure C++ and use Pinvoke to call those functions? I don't know, I'm kind of stuck here.

Maybe someone can shed some light on this terribly poorly documented topic...

MindSwipe
  • 7,193
  • 24
  • 47
SBS
  • 806
  • 5
  • 13
  • 1
    I think this link might help you : https://stackoverflow.com/a/5698663/11241587 – Theodor Badea Mar 22 '19 at 09:39
  • 1
    It been a while since I used Masm and never with managed code. The first thing is the compiler does optimization and you manual assembly code may never be as good as the optimizer. What managed code does it surrounds every call so that stack misalignment cannot cause the machine to crash. It also adds memory segments so accessing memory outside the memory assigned to the program cause exception and can recover from the exceptions without cause the machine to crash. Once you setup c++ code that calls I do not think are protected that same as in c# managed unless you add the directives. – jdweng Mar 22 '19 at 09:45
  • @TheodorBadea Thanks for the link. I think myself that the only possible way out is to put all code that needs performance into a couple of source files that are compiled without /clr, in order to minimize managed/unmanaged transitions. However, this somehow wracks the elegant idea of C++/CLI. You have to take care then about calls from managed classes that call into this code. For instance, I have managed properties in mind, which call down to getter/setter functions in the native classes. So the performance problem might just be shifted to another location in the project. – SBS Mar 22 '19 at 10:17
  • "I expected a big performance boost by replacing a compiled function with four 32-to-64-bit multiplications with a simple CPU `MUL` instruction. ". Actually, you replaced a C++ function with an ASM **function**, not an ASM **instruction**. C++ functions can be inlined by C++ compilers, eliminating**all** function call overhead. Your `AsmMul1` must be called, with the usual overhead – MSalters Mar 22 '19 at 12:59
  • @MSalters The call/return overhead of an x64 fastcall to this four-instruction routine, with parameters passed in registers, is very small compared to the total effort done inside the C++ function, which involves four IMULs and lots of adds and shifts. Inlining hardly can outweigh that here. – SBS Mar 22 '19 at 14:03

0 Answers0