The double-float loop that has to convert on the fly can't issue quite as fast. With some loop unrolling, gcc would probably do a better job.
Your i7-2620M is a dual-core with hyperthreading. Hyperthreading doesn't help when the bottleneck is CPU uop throughput, rather than branch mispredicts, cache misses, or even just long latency chains. Saturating your memory bandwidth with just scalar operations isn't easy.
From the asm output for your code on the Godbolt Compiler Explorer: gcc 5.3 makes about the same inner loop, BTW, so you're not losing out on much in this case by using an old gcc version.
double-double version inner loop (gcc 4.9.3 -O3 -march=sandybridge -fopenmp
):
## inner loop of <double,double>mult() with fused-domain uop counts
.L7:
mov edx, eax # 1 uop
add eax, 1 # 1 uop
mov ecx, DWORD PTR [r9+rdx*4] # 1 uop
vmovsd xmm0, QWORD PTR [r10+rdx*8] # 1 uop
vmulsd xmm0, xmm0, QWORD PTR [r8+rcx*8] # 2 uops
vaddsd xmm1, xmm1, xmm0 # 1 uop
cmp eax, esi # (macro-fused)
jne .L7 # 1 uop
total: 8 fused-domain uops, can issue at one iter per two clocks. It can also execute that fast: Three of the uops are loads, and SnB can do 4 loads per 2 clocks. 5 ALU uops are left (since SnB can't eliminate reg-reg moves in the rename stage, that was introduced with IvB). Anyway, there are no obvious bottlenecks on a single execution port. SnB's three ALU ports could handle up to six ALU uops per two cycles.
There's no micro-fusion because of using two-register addressing modes.
double-float version inner loop:
## inner loop of <double,float>mult() with fused-domain uop counts
.L7:
mov edx, eax # 1 uop
vxorpd xmm0, xmm0, xmm0 # 1 uop (no execution unit needed).
add eax, 1 # 1 uop
vcvtss2sd xmm0, xmm0, DWORD PTR [r9+rdx*4] # 2 uops
mov edx, DWORD PTR [r8+rdx*4] # 1 uop
vmulsd xmm0, xmm0, QWORD PTR [rsi+rdx*8] # 2 uops
vaddsd xmm1, xmm1, xmm0 # 1 uop
cmp eax, ecx # (macro-fused)
jne .L7 # 1 uop
gcc uses the xorpd
to break the loop-carried dependency chain. cvtss2sd
has a false dependency on the old value of xmm0, because it's badly designed and doesn't zero the top half of the register. (movsd
when used as a load does zero, but not when used as a reg-reg move. In that case, use movaps
unless you want merging.)
So, 10 fused-domain uops: can issue at one iteration per three clocks. I assume this is the only bottleneck, since it's just one extra ALU uop that needs an execution port. (SnB handles zeroing-idioms in the rename stage, so xorpd
doesn't need one). cvtss2sd
is a 2 uop instruction that apparently can't micro-fuse even if gcc used a one-register addressing mode. It has a throughput of one per clock. (On Haswell, it's a 2 uop instruction when used with a register src and dest, and on Skylake the throughput is reduced to one per 2 clocks, according to Agner Fog's testing.) That still wouldn't be a bottleneck for this loop on Skylake, though. It's still 10 fused-domain uops on Haswell / Skylake, and that's still the bottleneck.
-funroll-loops
should help with gcc 4.9.3
gcc does a moderately good job, with code like
mov edx, DWORD PTR [rsi+r14*4] # D.56355, *_40
lea r14d, [rax+2] # D.56355,
vcvtss2sd xmm6, xmm4, DWORD PTR [r8+r14*4] # D.56358, D.56358, *_36
vmulsd xmm2, xmm1, QWORD PTR [rcx+rdx*8] # D.56358, D.56358, *_45
vaddsd xmm14, xmm0, xmm13 # tmp, tmp, D.56358
vxorpd xmm1, xmm1, xmm1 # D.56358
mov edx, DWORD PTR [rsi+r14*4] # D.56355, *_40
lea r14d, [rax+3] # D.56355,
vcvtss2sd xmm10, xmm9, DWORD PTR [r8+r14*4] # D.56358, D.56358, *_36
vmulsd xmm7, xmm6, QWORD PTR [rcx+rdx*8] # D.56358, D.56358, *_45
vaddsd xmm3, xmm14, xmm2 # tmp, tmp, D.56358
vxorpd xmm6, xmm6, xmm6 # D.56358
Without the loop overhead, the work for each element is down to 8 fused-domain uops, and it's not a tiny loop that suffers from only issuing 2 uops every 3rd cycle (because 10 isn't a multiple of 4).
It could save the lea
instructions by using displacements, e.g. [r8+rax*4 + 12]
. IDK why gcc chooses not to.
Not even -ffast-math
gets it to vectorize at all. There's probably no point since doing a gather from the sparse matrix would outweigh the benefit of doing a load of 4 or 8 contiguous values from the non-sparse vector. (insertps
from memory is a 2-uop instruction that can't micro-fuse even with one-register addressing modes.)
On Broadwell or Skylake, vgatherdps
may be fast enough to give a speedup over. Probably a big speedup on Skylake. (Can gather 8 single-precision floats with a throughput of 8 floats per 5 clocks. Or vgatherqpd
can gather 4 double-precision floats with a throughput of 4 doubles per 4 clocks). This sets you up for a 256b vector FMA.