3

MSDN states:

Regardless of the interop technique used, special transition sequences, called thunks, are required each time a managed function calls an unmanaged function and vice versa. These thunks are inserted automatically by the Visual C++ compiler, but it is important to keep in mind that cumulatively, these transitions can be expensive in terms of performance.

However surely the CLR calls C++ & Win32 functions the whole time. In order to deal with files/network/windows and nearly anything else, unmanaged code must be called. How does it get out of the chunking penalty?

Here's an experiment written in C++/CLI that may help to describe my issue:

#define REPS 10000000

#pragma unmanaged
void go1() {
    for (int i = 0; i < REPS; i++)
        pow(i, 3);
}
#pragma managed
void go2() {
    for (int i = 0; i < REPS; i++)
        pow(i, 3);
}
void go3() {
    for (int i = 0; i < REPS; i++)
        Math::Pow(i, 3);
}

public ref class C1 {
public:
    static void Go() {
        auto sw = Stopwatch::StartNew();
        go1();
        Console::WriteLine(sw->ElapsedMilliseconds);
        sw->Restart();
        go2();
        Console::WriteLine(sw->ElapsedMilliseconds);
        sw->Restart();
        go3();
        Console::WriteLine(sw->ElapsedMilliseconds);
    }
};

//Go is called from a C# app

The results are (consistently):

405 (go1 - pure C++)
818 (go2 - managed code calling C++)
289 (go3 - pure managed)

Why go3 is faster than go1 is a bit of a mystery, but that's not my question. My question is that we see from go1 & go2 that the thunking penalty adds 400ms. How does go3 get out of this penalty, since it calls C++ to do the actual calculation?

Even if is this experiment is for some reason invalid, my question remains - does the CLR really have a thunking penalty every time it calls C++/Win32?

wezten
  • 2,126
  • 3
  • 25
  • 48
  • 1
    The experiment is really invalid. All time measurements should be done in Release configuration, and functions go1, go2 and go3 in Release may be removed since they don't have side effects. – Alex F Aug 12 '18 at 08:52
  • @AlexF You're right, I did it in release mode, and summed and output the totals so it won't be optimized away, and the results were `269,306,330`. So I guess Math.Pow does suffer from chunking after all. Please write this as an answer. – wezten Aug 12 '18 at 09:01
  • 1
    The whole purpose of Managed Code is to prevent a Blue Screen. So to make sure you do not get a Blue screen each function call must be protected by an exception handler or the code must be thoroughly verified that an exception will not occur. So the thunks are wrappers in the execution stack that adds exception handlers to catch the microprocessors exceptions. – jdweng Aug 12 '18 at 09:56

1 Answers1

9

Benchmarking is a black art, you got some misleading results here. Running the Release build is very important, if you did that right then you'll now notice that go1() doesn't take any time anymore. The native code optimizer has special knowledge of it, if you don't use its result then it eliminates it entirely.

You have to change the code to get reliable results. First put a loop around the Go() test body, repeat at least 20 times. This gets rid of jitting and caching overhead and helps see the large standard deviation. Knock a 0 off REPS so you don't have to wait too long. Favor Tools > Options > Debugging > General, "Suppress JIT optimization" unticked. Change the code, I recommend:

__declspec(noinline)
double go1() {
    double sum = 0;
    for (int i = 0; i < REPS; i++)
        sum += pow(i, 3);
    return sum;
}

Note how the sum variable forces the optimizer to keep the call, using the __declspec prevents the entire function from being deleted and avoids contaminating the Go() body. Do the same for go2 and go3, use [MethodImpl(MethodImplOptions::NoInlining)].

Results I see on my laptop: x64: 75, 84, 84, x86: 73, 89, 89 +5/-3 msec.

Three different mechanisms at work:

  • go1() code generation is as you'd expect in native code, a direct call to the __libm_sse2_pow_precise() CRT function in x64 mode. Nothing remarkable here, other than the risk of getting it deleted in the Release build.
  • go2() uses the thunk that you asked about. The docs are a bit too panicky about thunking, all that is required is for the code to write a cookie on the stack that prevents the garbage collector from blundering into unmanaged stack frames when it looks for object roots. It can be more expensive when it also has to convert the function arguments and/or return value, but that is not the case here. The jitter optimizer can't eliminate the pow() call, it has no special knowledge of CRT functions.
  • go3() uses a very different mechanism, in spite of the similar measurement. Math::Pow() is special-cased in the CLR, it uses the so-called FCall mechanism. No thunking, straight from managed code to compiled C++ machine code. These kind of micro-optimizations are pretty common in the CLR/BCL. Somewhat necessary, there is extra overhead since it performs checks on the argument that can throw an exception. Also the basic reason that the jitter optimizer did not eliminate the call, it generally avoids optimizations that make exceptions disappear.
Hans Passant
  • 922,412
  • 146
  • 1,693
  • 2,536
  • Thank you, I was confused about thunking, and this clears it all up. – wezten Aug 12 '18 at 12:54
  • PInvoke overhead (as in `go2()`) is also a bit tricky to measure, as there is a per-calling method part and a per-callsite part. See [dotnet/coreclr#2373](https://github.com/dotnet/coreclr/issues/2373) for some discussion. – Andy Ayers Aug 12 '18 at 18:53
  • Do beware that the linked bug report is not relevant to this question. go2() does not use the pinvoke marshaller and [DllImport] is not at play. This is C++ interop at work, the thunk is specific to C++/CLI code and has no equivalent in other languages. – Hans Passant Aug 16 '18 at 09:23