C# calling native code is faster than native calling native

Question

While doing some performance testing, I've run into a situation that I cannot seem to explain.

I have written the following C code:

void multi_arr(int32_t *x, int32_t *y, int32_t *res, int32_t len)
{
    for (int32_t i = 0; i < len; ++i)
    {
        res[i] = x[i] * y[i];
    }
}

I use gcc to compile it, along with a test driver, into a single binary. I also use gcc to compile it by itself into a shared object which I call from C# via p/invoke. The intent is to measure the performance overhead of calling native code from C#.

In both C and C#, I create equal length input arrays of random values and then measure how long it takes multi_arr to run. In both C# and C I use the POSIX clock_gettime() call for timing. I have positioned the timing calls immediately preceding and following the call to multi_arr, so input prep time etc do not impact results. I run 100 iterations and report both the average and the min times.

Even though C and C# are executing the exact same function, C# comes out ahead about 50% of the time, usually by a significant amount. For example, for a len of 1,048,576, C#'s min is 768,400 ns vs C's min of 1,344,105. C#'s avg is 1,018,865 vs C's 1,852,880. I put some different numbers into this graph (mind the log scales):

These results seem extremely wrong to me, but the artifact is consistent across multiple tests. I've checked the asm and IL to verify correctness. Bitness is the same. I have no idea what could be impacting performance to this degree. I've put a minimal reproduction example up here.

These tests were all run on Linux (KDE neon, based off Ubuntu Xenial) with dotnet-core 2.0.0 and gcc 5.0.4.

Has anyone seen this before?

Cannot reproduce (using Mono). Both versions run basically in the same time. Compiled with `mcs -optimize+ -unsafe`, mcs 3.2.8.0 on Ubuntu. — Kerrek SB, Sep 24 '17 at 21:34
@KerrekSB Interesting. I just tried with mono as well and the issue reduced. It still occurred at 131072 (avg 53,380ns vs 128,280ns) and a few others but the numbers for the most part were much closer. — Xcelled, Sep 24 '17 at 21:42
In my tests C# runs statistically a bit slower. The difference is actually very small. Maybe testing method is not good. — 0___________, Sep 24 '17 at 23:14
@PeterJ_01I posted the code, feel free to make suggestions for how I could improve it. — Xcelled, Sep 25 '17 at 02:23
A little more research turned up [this](https://stackoverflow.com/questions/8547778/why-are-elementwise-additions-much-faster-in-separate-loops-than-in-a-combined-l?rq=1) which implies that some performance difference is due to how alignment impacts the cache. Maybe C# is allocating the arrays with slightly better alignment? — Xcelled, Sep 25 '17 at 02:25
Quick tests show that if I change the C code to use a stack-allocated array (instead of malloc) it performs the way I expect, possibly lending credence to the cache theory. — Xcelled, Sep 25 '17 at 02:37
hate to ask, but did you enable optimization `-O2`/`-O3` for the c code calling itself? also i'd be curious to see the effect of link time optimization `-flto` — Mobius, Oct 01 '17 at 04:43

score 2 · Accepted Answer · answered Oct 03 '17 at 08:57

It is dependent on alignment, as you are already suspecting. Memory is returned such that the compiler can use it for structures that will not cause unnecessary faults when storing or retrieving datatypes such as doubles or integers, but it makes no promise as to how the block of memory fits into the cache(s).

How this varies is dependent on the hardware that you test on. Presuming you are talking about x86_64 here, that means the Intel or AMD processor and its relative speed of the caches compared to main memory access.

You can simulate this by testing with various alignments.

Here is an example program that I cobbled together. On my i7 I see large variations, but the first most unaligned access is reliably slower than the more aligned versions.

#include <inttypes.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>

void multi_arr(int32_t *x, int32_t *y, int32_t *res, int32_t len)
{
    for (int32_t i = 0; i < len; ++i)
    {
        res[i] = x[i] * y[i];
    }
}

uint64_t getnsec()
{
  struct timespec n;

  clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &n);
  return (uint64_t) n.tv_sec * 1000000000 + n.tv_nsec;
}

#define CACHE_SIZE (16 * 1024 * 1024 / sizeof(int32_t))
int main()
{
  int32_t *memory;
  int32_t *unaligned;
  int32_t *x;
  int32_t *y;
  int count;
  uint64_t start, elapsed;
  int32_t len = 1024 * 16;
  int64_t aligned = 1;

  memory = calloc(sizeof(int32_t), 4 * CACHE_SIZE);

  // make unaligned as unaligned as possible, e.g. to 0b11111111111111100

  unaligned = (int32_t *) (((intptr_t) memory + CACHE_SIZE) & ~(CACHE_SIZE - 1));
  printf("memory starts at %p, aligned %p\n", memory, unaligned);
  unaligned = (int32_t *) ((intptr_t) unaligned | (CACHE_SIZE - 1));
  printf("memory starts at %p, unaligned %p\n", memory, unaligned);

  for (aligned = 1; aligned < CACHE_SIZE; aligned <<= 1)
  {
    x = (int32_t *) (((intptr_t) unaligned + CACHE_SIZE) & ~(aligned - 1));

    start = getnsec();
    for (count = 1; count < 1000; count++)
    {
      multi_arr(x, x + len, x + len + len, len);
    }
    elapsed = getnsec() - start;
    printf("memory starts at %p, aligned %08"PRIx64" to > cache = %p elapsed=%"PRIu64"\n", unaligned, aligned - 1, x, elapsed);
  }

  exit(0);
}

C# calling native code is faster than native calling native

1 Answers1