Why is malloc 7x slower than new for Intel's icc?

Question

I benchmarked malloc vs. new for allocating an array of floats. My understanding was that the operations performed by malloc are a subset of the operations performed by new -- that malloc just allocates but new allocates and constructs, although I'm not sure if that's a meaningful difference for primitives.

Benchmarking results with gcc give the expected behavior. malloc() is faster. There are even SO questions asking the opposite of this one.

With icc malloc can be 7x slower than new. How is possible?!

Everything that follows is just details of the benchmarking procedure.

For benchmarking, I used a protocol recently described by Intel. Here are my results.

Clock cycles elapsed while allocating an array of 4000 floats with GNU's gcc:

new memory allocation, cycles            12168
malloc allocation, cycles                 5144

And with Intel's icc:

new    memory allocation clock cycles     7251
malloc memory allocation clock cycles    52372

How I'm using malloc:

volatile float* numbers = (float*)malloc(sizeof(float)*size);

How I'm using new:

volatile float* numbers = new float[size];

The volatile is there because in previous benchmarking attempts I've had issues with wily compilers optimizing away entire function calls and generating programs that just store constants. (The function implementations that the compiler chose to optimize in this way were indeed faster than the ones that it didn't!) I tried with the volatile removed just to be sure and the results were the same.

I sandwich the portion of code I want to benchmark between two macros.

Macro that comes before a function:

#define CYCLE_COUNT_START \
asm volatile ("CPUID\n\t" \
"RDTSC\n\t" \
"mov %%edx, %0\n\t" \
"mov %%eax, %1\n\t": "=r" (cycles_high), "=r" (cycles_low):: \
"%rax", "%rbx", "%rcx", "%rdx");

Macro that comes after a function:

#define CYCLE_COUNT_END \
asm volatile("RDTSCP\n\t" \
"mov %%edx, %0\n\t" \
"mov %%eax, %1\n\t" \
"CPUID\n\t": "=r" (cycles_high1), "=r" (cycles_low1):: \
"%rax", "%rbx", "%rcx", "%rdx"); \
start = ( ((uint64_t)cycles_high << 32) | cycles_low ); \
end = ( ((uint64_t)cycles_high1 << 32) | cycles_low1 ); \
ellapsed_cycles = end - start;

So the allocation call with the sandwiching macros for new is this:

CYCLE_COUNT_START
volatile float* numbers = new float[size];
CYCLE_COUNT_END

After that I check the value of ellapsed_cycles to see how everything went.

And just to be sure I'm not doing something silly, here's how I'm compiling with icc:

icc -O3 -ipo -no-prec-div -std=c++11 heap_version3.cpp           -o heap_version3
icc -O3 -ipo -no-prec-div -std=c++11 malloc_heap_version3.cpp    -o malloc_heap_version3

And with gcc:

g++-4.8 -Ofast -march=native -std=c++11 heap_version3.cpp        -o heap_version3
g++-4.8 -Ofast -march=native -std=c++11 malloc_heap_version3.cpp -o malloc_heap_version3

This is on a 2012 MacBook Pro with the corei7-avx instructions available. I have the 'as' binary swapped with a script that matches the one here so that gcc can use AVX instructions.

EDIT 1

To answer those who want to see more loop iterations, please skim the Intel link and then post. On the other hand, I would probably have the same reaction and so here are those loop iterations.

The array size is still 4000 and each run of the program still does a single allocation of memory. I didn't want to change what was being benchmarked by allocating a larger array that doesn't fit in L1 or repeatedly allocating and deallocating memory and raising other questions about memory. The program is run in a loop by bash. I run 4 separate programs for the benchmark, all 4 in every loop iteration in an effort to reduce heterogeneity due to other running processes.

for i in $(seq 1 10000); do
    echo gcc malloc $(./gcc_malloc_heap_version3 | head -n 1 | cut -d" " -f 4-)
    echo icc malloc $(./icc_malloc_heap_version3 | head -n 1 | cut -d" " -f 4-)
    echo gcc new $(./gcc_heap_version3 | head -n 1 | cut -d" " -f 4-)
    echo icc new $(./icc_heap_version3 | head -n 1 | cut -d" " -f 4-)
done

icc memory allocation timings:

       malloc       new
Min.   : 3093      1150
1st Qu.: 3729      1367
Median : 3891      1496
Mean   : 4015      1571
3rd Qu.: 4099      1636
Max.   :33231    183377

    Welch Two Sample t-test
    p-value < 2.2e-16

The observed difference is unlikely to have occurred by chance.

Density estimates for compilers and allocation methods:

probability density estimates of elapsed clock cycles during memory allocation

The difference is now less dramatic but the order for icc is still the reverse of the expected.

EDIT 2

The results are nearly identical for an array of char. Because sizeof(int) gives me 4 and sizeof(char) gives me 1 I increased the array length to 16,000.

EDIT 3

source code and scripts

EDIT 4

Same data replotted as a timecourse for the first 100 allocations. timecourse of clock cycles per memory allocation

Do a million allocations in a loop, divide the total "time" by a million to get average. Then run both programs a hundred times each, and get the average from that. Then you should have pretty good results. — Some programmer dude, Mar 27 '14 at 11:13
Could you show the whole reproducible test case (ie. full source code with build scripts)? — liori, Mar 27 '14 at 19:27
@liori I just added them to github. https://github.com/khouli/malloc_vs_new — Praxeolitic, Mar 27 '14 at 21:27
Looked at the Intel link - why are you still using `CPUID` if you switched to the pseudo-serializing `RDTSCP`? The added overhead should be huge compared to the malloc/new calls, and the extra clobbering may force your compiler to make unfriendly adjustments — Leeor, Mar 27 '14 at 22:03
@Leeor You mean the second CPUID, right? That one prevents subsequent instructions from being executed before RDTSCP. The barrier provided by RDTSCP itself prevents the timed block from being executed after RDTSCP. pg 15 of the Intel document. — Praxeolitic, Mar 27 '14 at 23:00
I know the barrier is uni directional, but the CPUID seems like it causes more damage than a stray instruction — Leeor, Mar 28 '14 at 05:43
I just tried the benchmark with both CPUIDs removed. Same results, but the benchmark itself certainly went faster. — Praxeolitic, Mar 28 '14 at 05:58
So, judging from the code—I'd say that it's quite possible that both `new` and `malloc` have some bookkeeping costs triggered by first allocation; possibly different for various implementations of the standard library and available optimizations. Could you try benchmarking, let say, 1000th allocation inside a single process instead and again draw the histograms? — liori, Mar 28 '14 at 15:25
Praxeolitic, please, publish disassembler text of `main` function/ — osgx, Mar 28 '14 at 21:38
@Praxeolitic: then indeed you have something interesting here, worth investigating. Do both compilers use the same standard library? — liori, Mar 28 '14 at 22:42

score 1 · Answer 1 · answered Mar 27 '14 at 11:11

1

It doesn't work like that. Processors and operating systems are complicated. You can't just make a single call taking a few microseconds and expect to get meaningful timing information. For starters, another app could use your CPU for a bit and RDTSC will continue counting.

answered Mar 27 '14 at 11:11

gnasher729

51,477
5
75
98

The function calls can indeed be interrupted but if that's the explanation then what does one function do that causes it to be interrupted more often or for longer? This behavior is consistent for multiple tests, even when I reorder the tests. How do you suggest that I change the benchmark? Would a histogram of run times be more convincing? Intel's whitepaper does show the suggested apparatus being used for timings on the order of 100s of clock cycles. – Praxeolitic Mar 27 '14 at 13:44
You don't get preempted *that* often that a simple malloc/new call can't be completed with high probability in silence. We'd be in real trouble if that was the case. – Leeor Mar 27 '14 at 22:01
Well you have many other overheads. The ones I can see here are function calls and library loading the first time you use a function in your process. The very fact that your graphs show a wide spread and that the distribution does not look normal suggests that you should try executing several times, as suggested by gnasher729 and J. Pileborg. – Antoine Trouve Mar 28 '14 at 00:58
Please read the Intel link and the benchmarking code. There isn't any "library loading" or additional function call overhead in the timings. The distributions do look plenty normal but there's no reason to demand normality in the first place. You *should* expect a spread between distributions -- they're timing different things. I don't see why you find 10000 trials insufficient. The distinct distributions are visually obvious. – Praxeolitic Mar 28 '14 at 04:49

Why is malloc 7x slower than new for Intel's icc?

1 Answers1