(Note: there is an update at the bottom of the question)
Look at this stripped down small_vector benchmark:
#include <cstdlib>
#define NOINLINE __attribute__((noinline))
class Array {
public:
Array() : m_data(m_buffer), m_size(0) {}
~Array() {
// if (m_data != m_buffer) free(m_data);
}
void append(int v) {
if (m_size >= 3) expand();
m_data[m_size++] = v;
}
private:
int *m_data;
int m_size;
int m_buffer[3];
NOINLINE void expand() {}
};
NOINLINE
void add(Array &array) {
array.append(11);
array.append(22);
array.append(33);
}
int main() {
for (int i = 0; i < 1000000000; i++) {
Array array;
add(array);
}
}
(NOINLINE is used because this is how the compiler decided to inline the original small_vector code)
If this code is compiled with clang 11, it becomes faster if I uncomment the commented line in ~Array
(note, the free
call is never executed). On my machine (i7 8750), the difference is 18%. On quick-bench.com, the difference is smaller, 5.3%.
I understand that this is a microbenchmark, wild things can happen, but: add
is a 37-instruction, 120-byte code, so it is not that small. And it is the same in both versions. The only difference is main
, which is not that different either, it's just the loop which is compiled a little bit differently. Yet, there is a large performance difference, and the version with more instructions/branches runs faster. If I run perf stat -d -d -d
, I don't see anything suspicious (no significant difference for branch/cache-misses, but still, insn/cycle difference is huge: 2.4 vs 3.12):
Slow
3939.23 msec task-clock # 1.000 CPUs utilized 10 context-switches # 0.003 K/sec 0 cpu-migrations # 0.000 K/sec 107 page-faults # 0.027 K/sec 13345669446 cycles # 3.388 GHz (38.26%) 32029499558 instructions # 2.40 insn per cycle (45.98%) 6005787151 branches # 1524.610 M/sec (45.99%) 71062 branch-misses # 0.00% of all branches (46.09%) 6000238616 L1-dcache-loads # 1523.202 M/sec (46.20%) 180237 L1-dcache-load-misses # 0.00% of all L1-dcache accesses (46.30%) 35516 LLC-loads # 0.009 M/sec (30.87%) 13655 LLC-load-misses # 38.45% of all LL-cache accesses (30.87%) not supported L1-icache-loads 545548 L1-icache-load-misses (30.87%) 6003584439 dTLB-loads # 1524.051 M/sec (30.86%) 5290 dTLB-load-misses # 0.00% of all dTLB cache accesses (30.76%) 4583 iTLB-loads # 0.001 M/sec (30.65%) 4222 iTLB-load-misses # 92.12% of all iTLB cache accesses (30.55%) not supported L1-dcache-prefetches not supported L1-dcache-prefetch-misses 3.939756460 seconds time elapsed 3.939678000 seconds user 0.000000000 seconds sys
Fast
3316.00 msec task-clock # 1.000 CPUs utilized 5 context-switches # 0.002 K/sec 0 cpu-migrations # 0.000 K/sec 110 page-faults # 0.033 K/sec 11235910328 cycles # 3.388 GHz (38.24%) 35013565821 instructions # 3.12 insn per cycle (45.96%) 7002622651 branches # 2111.770 M/sec (45.96%) 59596 branch-misses # 0.00% of all branches (46.02%) 7001546754 L1-dcache-loads # 2111.446 M/sec (46.14%) 143554 L1-dcache-load-misses # 0.00% of all L1-dcache accesses (46.26%) 20608 LLC-loads # 0.006 M/sec (30.88%) 3562 LLC-load-misses # 17.28% of all LL-cache accesses (30.88%) not supported L1-icache-loads 431694 L1-icache-load-misses (30.88%) 7003243717 dTLB-loads # 2111.958 M/sec (30.88%) 3296 dTLB-load-misses # 0.00% of all dTLB cache accesses (30.82%) 2836 iTLB-loads # 0.855 K/sec (30.70%) 3436 iTLB-load-misses # 121.16% of all iTLB cache accesses (30.58%) not supported L1-dcache-prefetches not supported L1-dcache-prefetch-misses 3.316414943 seconds time elapsed 3.312479000 seconds user 0.003995000 seconds sys
Note: if I unroll the loop manually in main
to this:
for (int i = 0; i < 250000000; i++) {
{Array array; add(array);}
{Array array; add(array);}
{Array array; add(array);}
{Array array; add(array);}
}
the performance difference stays the same.
Do you have any clue what's the cause of the difference? Any tips which perf
event to check?
For reference, here are the asm listings:
Slow main
401190: 55 push rbp
401191: 41 56 push r14
401193: 53 push rbx
401194: 48 83 ec 20 sub rsp,0x20
401198: bd 00 ca 9a 3b mov ebp,0x3b9aca00
40119d: 4c 8d 74 24 14 lea r14,[rsp+0x14]
4011a2: 48 8d 5c 24 08 lea rbx,[rsp+0x8]
4011a7: 66 0f 1f 84 00 00 00 nop WORD PTR [rax+rax*1+0x0]
4011ae: 00 00
4011b0: 4c 89 74 24 08 mov QWORD PTR [rsp+0x8],r14
4011b5: c7 44 24 10 00 00 00 mov DWORD PTR [rsp+0x10],0x0
4011bc: 00
4011bd: 48 89 df mov rdi,rbx
4011c0: e8 4b ff ff ff call 401110 <add(Array&)>
4011c5: 83 c5 ff add ebp,0xffffffff
4011c8: 75 e6 jne 4011b0 <main+0x20>
4011ca: 31 c0 xor eax,eax
4011cc: 48 83 c4 20 add rsp,0x20
4011d0: 5b pop rbx
4011d1: 41 5e pop r14
4011d3: 5d pop rbp
4011d4: c3 ret
Fast main
4011b0: 55 push rbp
4011b1: 41 56 push r14
4011b3: 53 push rbx
4011b4: 48 83 ec 20 sub rsp,0x20
4011b8: bd 00 ca 9a 3b mov ebp,0x3b9aca00
4011bd: 48 8d 5c 24 14 lea rbx,[rsp+0x14]
4011c2: 4c 8d 74 24 08 lea r14,[rsp+0x8]
4011c7: eb 0c jmp 4011d5 <main+0x25>
4011c9: 0f 1f 80 00 00 00 00 nop DWORD PTR [rax+0x0]
4011d0: 83 c5 ff add ebp,0xffffffff
4011d3: 74 26 je 4011fb <main+0x4b>
4011d5: 48 89 5c 24 08 mov QWORD PTR [rsp+0x8],rbx
4011da: c7 44 24 10 00 00 00 mov DWORD PTR [rsp+0x10],0x0
4011e1: 00
4011e2: 4c 89 f7 mov rdi,r14
4011e5: e8 46 ff ff ff call 401130 <add(Array&)>
4011ea: 48 8b 7c 24 08 mov rdi,QWORD PTR [rsp+0x8]
4011ef: 48 39 df cmp rdi,rbx
4011f2: 74 dc je 4011d0 <main+0x20>
4011f4: e8 37 fe ff ff call 401030 <free@plt>
4011f9: eb d5 jmp 4011d0 <main+0x20>
4011fb: 31 c0 xor eax,eax
4011fd: 48 83 c4 20 add rsp,0x20
401201: 5b pop rbx
401202: 41 5e pop r14
401204: 5d pop rbp
401205: c3 ret
Add
(same for slow and fast)
401130: 53 push rbx
401131: 48 89 fb mov rbx,rdi
401134: 8b 4f 08 mov ecx,DWORD PTR [rdi+0x8]
401137: 83 f9 03 cmp ecx,0x3
40113a: 7c 0b jl 401147 <add(Array&)+0x17>
40113c: 48 89 df mov rdi,rbx
40113f: e8 cc 00 00 00 call 401210 <Array::expand()>
401144: 8b 4b 08 mov ecx,DWORD PTR [rbx+0x8]
401147: 48 8b 03 mov rax,QWORD PTR [rbx]
40114a: 8d 51 01 lea edx,[rcx+0x1]
40114d: 89 53 08 mov DWORD PTR [rbx+0x8],edx
401150: 48 63 c9 movsxd rcx,ecx
401153: c7 04 88 0b 00 00 00 mov DWORD PTR [rax+rcx*4],0xb
40115a: 8b 4b 08 mov ecx,DWORD PTR [rbx+0x8]
40115d: 83 f9 03 cmp ecx,0x3
401160: 7c 0e jl 401170 <add(Array&)+0x40>
401162: 48 89 df mov rdi,rbx
401165: e8 a6 00 00 00 call 401210 <Array::expand()>
40116a: 8b 4b 08 mov ecx,DWORD PTR [rbx+0x8]
40116d: 48 8b 03 mov rax,QWORD PTR [rbx]
401170: 8d 51 01 lea edx,[rcx+0x1]
401173: 89 53 08 mov DWORD PTR [rbx+0x8],edx
401176: 48 63 c9 movsxd rcx,ecx
401179: c7 04 88 16 00 00 00 mov DWORD PTR [rax+rcx*4],0x16
401180: 8b 4b 08 mov ecx,DWORD PTR [rbx+0x8]
401183: 83 f9 03 cmp ecx,0x3
401186: 7c 0e jl 401196 <add(Array&)+0x66>
401188: 48 89 df mov rdi,rbx
40118b: e8 80 00 00 00 call 401210 <Array::expand()>
401190: 8b 4b 08 mov ecx,DWORD PTR [rbx+0x8]
401193: 48 8b 03 mov rax,QWORD PTR [rbx]
401196: 8d 51 01 lea edx,[rcx+0x1]
401199: 89 53 08 mov DWORD PTR [rbx+0x8],edx
40119c: 48 63 c9 movsxd rcx,ecx
40119f: c7 04 88 21 00 00 00 mov DWORD PTR [rax+rcx*4],0x21
4011a6: 5b pop rbx
4011a7: c3 ret
UPDATE
I managed to make the difference even larger, change main
to this (it's just an added malloc
and free
call):
void *d;
int main() {
d = malloc(1);
for (int i = 0; i < 1000000000; i++) {
Array array;
add(array);
}
free(d);
}
With this main
, the slow version becomes even slower, the difference between slow and fast is ~30%!