0

The sparse matrix-vector product is a memory bound operation due to a very low arithmetic intensity. Since a float storage format would require 4+4=8 bytes per non-zero compared to 4+8=12 bytes for doubles (value and column index), one should be able to expect about 33% faster execution when switching to floats. I constructed a benchmark which assembles a 1000000x1000000 matrix with 200 non-zeros per row, and then take the minimum from 20 multiplications. Source code on github here.

The results are roughly what I expected. When I run the benchmark on my Intel Core i7-2620M, I see something like a 30% faster execution. The small difference can be seen in the bandwidth drop from about 19.0 GB/s (doubles) to about 18.0 GB/s (floats) out of the 21.3 GB/s in the spec.

Now, since the data for the matrix is almost 1000 larger than that for the vectors, one would expect that the faster performance should be obtained also for the case where only the matrix is in single precision, but where the vectors remains as doubles. I went on to try this, and then made sure to use the lower precision for the computations. When I run it, however, effective bandwidth usage suddenly drops to about 14.4 GB/s, giving only a 12% faster execution than the full double version. How can one understand this?


I'm using Ubuntu 14.04 with GCC 4.9.3.

Run times:

// double(mat)-double(vec)
Wall time: 0.127577 s
Bandwidth: 18.968 GB/s
Compute:   3.12736 Gflop/s

// float(mat)-float(vec)
Wall time: 0.089386 s
Bandwidth: 18.0333 GB/s
Compute:   4.46356 Gflop/s

// float(mat)-double(vec)
Wall time: 0.112134 s
Bandwidth: 14.4463 GB/s
Compute:   3.55807 Gflop/s

Update

See the answer by Peter Cordes below. In short, dependencies between loop iterations from the double-to-float conversion are responsible for the overhead. By unrolling the loop (see unroll-loop branch at github), full bandwidth usage is regained for both the float-double and the float-float versions!

New run times:

// float(mat)-float(vec)
Wall time: 0.084455 s
Bandwidth: 19.0861 GB/s
Compute:   4.72417 Gflop/s

// float(mat)-double(vec)
Wall time: 0.0865598 s
Bandwidth: 18.7145 GB/s
Compute:   4.6093 Gflop/s
kalj
  • 1,432
  • 2
  • 13
  • 30
  • If the two types don't match, isn't there going to be an extra conversion happening? [As in this example](https://godbolt.org/g/H0jFzq) – Dan Mašek Apr 11 '16 at 22:01
  • Not saying that would have to be the main reason, but compared to the double-double and float-float variants, which would produce roughly equivalent code, this will be different and longer. Question is what assembly gets generated for the various combinations of data types. – Dan Mašek Apr 11 '16 at 22:14
  • 2
    Look at [this answer](http://stackoverflow.com/a/16597686/3962537). [`mult`](https://godbolt.org/g/Ml2njj). Not sure if that could be happening there. One possible tool to try is the [Intel Architecture Code Analyzer](https://software.intel.com/en-us/articles/intel-architecture-code-analyzer-download). – Dan Mašek Apr 11 '16 at 22:43
  • 1
    @DanMašek: That would have been smarter code for gcc to emit. It uses `xorpd` to break the false dep, then uses a `cvtss2sd` that can't micro-fuse, so it bottlenecks on fused-domain uop throughput. – Peter Cordes Apr 12 '16 at 00:15
  • @PeterCordes Thanks. Admittably I don't know near enough about this to have understood it to this level without few days of work and research. – Dan Mašek Apr 12 '16 at 00:22
  • @DanMašek: I just did what IACA would have told you: frontend bottleneck when the conversion is required :P Actually both loops are bottlenecked on the frontend, but the double-float one is worse. If the OP had more cores, he'd still be able to saturate his memory bandwidth, but the bottleneck in this case isn't something that hyperthreading can help with. – Peter Cordes Apr 12 '16 at 00:23

1 Answers1

2

The double-float loop that has to convert on the fly can't issue quite as fast. With some loop unrolling, gcc would probably do a better job.

Your i7-2620M is a dual-core with hyperthreading. Hyperthreading doesn't help when the bottleneck is CPU uop throughput, rather than branch mispredicts, cache misses, or even just long latency chains. Saturating your memory bandwidth with just scalar operations isn't easy.


From the asm output for your code on the Godbolt Compiler Explorer: gcc 5.3 makes about the same inner loop, BTW, so you're not losing out on much in this case by using an old gcc version.

double-double version inner loop (gcc 4.9.3 -O3 -march=sandybridge -fopenmp):

## inner loop of <double,double>mult() with fused-domain uop counts
.L7:
    mov     edx, eax  # 1 uop
    add     eax, 1    # 1 uop
    mov     ecx, DWORD PTR [r9+rdx*4] # 1 uop
    vmovsd  xmm0, QWORD PTR [r10+rdx*8] # 1 uop
    vmulsd  xmm0, xmm0, QWORD PTR [r8+rcx*8]    # 2 uops
    vaddsd  xmm1, xmm1, xmm0    # 1 uop
    cmp     eax, esi  #  (macro-fused)
    jne     .L7       #  1 uop

total: 8 fused-domain uops, can issue at one iter per two clocks. It can also execute that fast: Three of the uops are loads, and SnB can do 4 loads per 2 clocks. 5 ALU uops are left (since SnB can't eliminate reg-reg moves in the rename stage, that was introduced with IvB). Anyway, there are no obvious bottlenecks on a single execution port. SnB's three ALU ports could handle up to six ALU uops per two cycles.

There's no micro-fusion because of using two-register addressing modes.


double-float version inner loop:

## inner loop of <double,float>mult() with fused-domain uop counts
.L7:
    mov     edx, eax  # 1 uop
    vxorpd  xmm0, xmm0, xmm0    # 1 uop (no execution unit needed).
    add     eax, 1    # 1 uop
    vcvtss2sd       xmm0, xmm0, DWORD PTR [r9+rdx*4]      # 2 uops
    mov     edx, DWORD PTR [r8+rdx*4] # 1 uop
    vmulsd  xmm0, xmm0, QWORD PTR [rsi+rdx*8]   # 2 uops
    vaddsd  xmm1, xmm1, xmm0    # 1 uop
    cmp     eax, ecx  # (macro-fused)
    jne     .L7       # 1 uop

gcc uses the xorpd to break the loop-carried dependency chain. cvtss2sd has a false dependency on the old value of xmm0, because it's badly designed and doesn't zero the top half of the register. (movsd when used as a load does zero, but not when used as a reg-reg move. In that case, use movaps unless you want merging.)

So, 10 fused-domain uops: can issue at one iteration per three clocks. I assume this is the only bottleneck, since it's just one extra ALU uop that needs an execution port. (SnB handles zeroing-idioms in the rename stage, so xorpd doesn't need one). cvtss2sd is a 2 uop instruction that apparently can't micro-fuse even if gcc used a one-register addressing mode. It has a throughput of one per clock. (On Haswell, it's a 2 uop instruction when used with a register src and dest, and on Skylake the throughput is reduced to one per 2 clocks, according to Agner Fog's testing.) That still wouldn't be a bottleneck for this loop on Skylake, though. It's still 10 fused-domain uops on Haswell / Skylake, and that's still the bottleneck.


-funroll-loops should help with gcc 4.9.3

gcc does a moderately good job, with code like

    mov     edx, DWORD PTR [rsi+r14*4]        # D.56355, *_40
    lea     r14d, [rax+2]     # D.56355,
    vcvtss2sd       xmm6, xmm4, DWORD PTR [r8+r14*4]      # D.56358, D.56358, *_36
    vmulsd  xmm2, xmm1, QWORD PTR [rcx+rdx*8]   # D.56358, D.56358, *_45
    vaddsd  xmm14, xmm0, xmm13  # tmp, tmp, D.56358
    vxorpd  xmm1, xmm1, xmm1    # D.56358

    mov     edx, DWORD PTR [rsi+r14*4]        # D.56355, *_40
    lea     r14d, [rax+3]     # D.56355,
    vcvtss2sd       xmm10, xmm9, DWORD PTR [r8+r14*4]     # D.56358, D.56358, *_36
    vmulsd  xmm7, xmm6, QWORD PTR [rcx+rdx*8]   # D.56358, D.56358, *_45
    vaddsd  xmm3, xmm14, xmm2   # tmp, tmp, D.56358
    vxorpd  xmm6, xmm6, xmm6    # D.56358

Without the loop overhead, the work for each element is down to 8 fused-domain uops, and it's not a tiny loop that suffers from only issuing 2 uops every 3rd cycle (because 10 isn't a multiple of 4).

It could save the lea instructions by using displacements, e.g. [r8+rax*4 + 12]. IDK why gcc chooses not to.

Not even -ffast-math gets it to vectorize at all. There's probably no point since doing a gather from the sparse matrix would outweigh the benefit of doing a load of 4 or 8 contiguous values from the non-sparse vector. (insertps from memory is a 2-uop instruction that can't micro-fuse even with one-register addressing modes.)

On Broadwell or Skylake, vgatherdps may be fast enough to give a speedup over. Probably a big speedup on Skylake. (Can gather 8 single-precision floats with a throughput of 8 floats per 5 clocks. Or vgatherqpd can gather 4 double-precision floats with a throughput of 4 doubles per 4 clocks). This sets you up for a 256b vector FMA.

Community
  • 1
  • 1
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Wow, thanks for a very thorough answer, which partly flew over my head. So are you saying that with the added conversion unavoidably has this overhead, or that it could be improved by better compilation/vectorization? – kalj Apr 12 '16 at 15:41
  • @kalj: It's an extra one or two uops required in the inner loop, yeah. Loop unrolling should get it to run as fast as the version without conversion, though. – Peter Cordes Apr 12 '16 at 19:53
  • 1
    Yes! Unrolling made it run at the same speed as the others. I tried different block sizes, and 8 gave the best performance. Interestingly enough, the float-float version also started yielding exactly the same bandwidth as the double-double version! Thanks a bunch! – kalj Apr 13 '16 at 14:37