Why is GCC's AVX slower while LLVM's faster?

Question

I'd like to better understand why two very similar pieces of code seem to perform dramatically different on my computer. These tests are on a Ryzen processor, with gcc-trunk and Julia 0.7-alpha (LLVM 6.0). gcc-8 appears similar, while Julia 0.6.3 (LLVM 3.9) is slightly slower than v0.7.

I wrote generated functions (think C++ templates) that produce unrolled code for matrix operations, as well as a simple transpiler that can translate uncomplicated code to Fortran.

For 8x8 matrix multiplication, here is what the Fortran code looks like:

module mul8mod

implicit none

contains


subroutine mul8x8(A, B, C)
    real(8), dimension(64), intent(in) :: A, B
    real(8), dimension(64), intent(out) :: C

    C(1) = A(1) * B(1) + A(9) * B(2) + A(17) * B(3) + A(25) * B(4)
    C(1) = C(1) + A(33) * B(5) + A(41) * B(6) + A(49) * B(7) + A(57) * B(8)
    C(2) = A(2) * B(1) + A(10) * B(2) + A(18) * B(3) + A(26) * B(4)
    C(2) = C(2) + A(34) * B(5) + A(42) * B(6) + A(50) * B(7) + A(58) * B(8)
    C(3) = A(3) * B(1) + A(11) * B(2) + A(19) * B(3) + A(27) * B(4)
    C(3) = C(3) + A(35) * B(5) + A(43) * B(6) + A(51) * B(7) + A(59) * B(8)
    C(4) = A(4) * B(1) + A(12) * B(2) + A(20) * B(3) + A(28) * B(4)
    C(4) = C(4) + A(36) * B(5) + A(44) * B(6) + A(52) * B(7) + A(60) * B(8)
    C(5) = A(5) * B(1) + A(13) * B(2) + A(21) * B(3) + A(29) * B(4)
    C(5) = C(5) + A(37) * B(5) + A(45) * B(6) + A(53) * B(7) + A(61) * B(8)
    C(6) = A(6) * B(1) + A(14) * B(2) + A(22) * B(3) + A(30) * B(4)
    C(6) = C(6) + A(38) * B(5) + A(46) * B(6) + A(54) * B(7) + A(62) * B(8)
    C(7) = A(7) * B(1) + A(15) * B(2) + A(23) * B(3) + A(31) * B(4)
    C(7) = C(7) + A(39) * B(5) + A(47) * B(6) + A(55) * B(7) + A(63) * B(8)
    C(8) = A(8) * B(1) + A(16) * B(2) + A(24) * B(3) + A(32) * B(4)
    C(8) = C(8) + A(40) * B(5) + A(48) * B(6) + A(56) * B(7) + A(64) * B(8)
    C(9) = A(1) * B(9) + A(9) * B(10) + A(17) * B(11) + A(25) * B(12)
    C(9) = C(9) + A(33) * B(13) + A(41) * B(14) + A(49) * B(15) + A(57) * B(16)
    C(10) = A(2) * B(9) + A(10) * B(10) + A(18) * B(11) + A(26) * B(12)
    C(10) = C(10) + A(34) * B(13) + A(42) * B(14) + A(50) * B(15) + A(58) * B(16)
    C(11) = A(3) * B(9) + A(11) * B(10) + A(19) * B(11) + A(27) * B(12)
    C(11) = C(11) + A(35) * B(13) + A(43) * B(14) + A(51) * B(15) + A(59) * B(16)
    C(12) = A(4) * B(9) + A(12) * B(10) + A(20) * B(11) + A(28) * B(12)
    C(12) = C(12) + A(36) * B(13) + A(44) * B(14) + A(52) * B(15) + A(60) * B(16)
    C(13) = A(5) * B(9) + A(13) * B(10) + A(21) * B(11) + A(29) * B(12)
    C(13) = C(13) + A(37) * B(13) + A(45) * B(14) + A(53) * B(15) + A(61) * B(16)
    C(14) = A(6) * B(9) + A(14) * B(10) + A(22) * B(11) + A(30) * B(12)
    C(14) = C(14) + A(38) * B(13) + A(46) * B(14) + A(54) * B(15) + A(62) * B(16)
    C(15) = A(7) * B(9) + A(15) * B(10) + A(23) * B(11) + A(31) * B(12)
    C(15) = C(15) + A(39) * B(13) + A(47) * B(14) + A(55) * B(15) + A(63) * B(16)
    C(16) = A(8) * B(9) + A(16) * B(10) + A(24) * B(11) + A(32) * B(12)
    C(16) = C(16) + A(40) * B(13) + A(48) * B(14) + A(56) * B(15) + A(64) * B(16)
    C(17) = A(1) * B(17) + A(9) * B(18) + A(17) * B(19) + A(25) * B(20)
    C(17) = C(17) + A(33) * B(21) + A(41) * B(22) + A(49) * B(23) + A(57) * B(24)
    C(18) = A(2) * B(17) + A(10) * B(18) + A(18) * B(19) + A(26) * B(20)
    C(18) = C(18) + A(34) * B(21) + A(42) * B(22) + A(50) * B(23) + A(58) * B(24)
    C(19) = A(3) * B(17) + A(11) * B(18) + A(19) * B(19) + A(27) * B(20)
    C(19) = C(19) + A(35) * B(21) + A(43) * B(22) + A(51) * B(23) + A(59) * B(24)
    C(20) = A(4) * B(17) + A(12) * B(18) + A(20) * B(19) + A(28) * B(20)
    C(20) = C(20) + A(36) * B(21) + A(44) * B(22) + A(52) * B(23) + A(60) * B(24)
    C(21) = A(5) * B(17) + A(13) * B(18) + A(21) * B(19) + A(29) * B(20)
    C(21) = C(21) + A(37) * B(21) + A(45) * B(22) + A(53) * B(23) + A(61) * B(24)
    C(22) = A(6) * B(17) + A(14) * B(18) + A(22) * B(19) + A(30) * B(20)
    C(22) = C(22) + A(38) * B(21) + A(46) * B(22) + A(54) * B(23) + A(62) * B(24)
    C(23) = A(7) * B(17) + A(15) * B(18) + A(23) * B(19) + A(31) * B(20)
    C(23) = C(23) + A(39) * B(21) + A(47) * B(22) + A(55) * B(23) + A(63) * B(24)
    C(24) = A(8) * B(17) + A(16) * B(18) + A(24) * B(19) + A(32) * B(20)
    C(24) = C(24) + A(40) * B(21) + A(48) * B(22) + A(56) * B(23) + A(64) * B(24)
    C(25) = A(1) * B(25) + A(9) * B(26) + A(17) * B(27) + A(25) * B(28)
    C(25) = C(25) + A(33) * B(29) + A(41) * B(30) + A(49) * B(31) + A(57) * B(32)
    C(26) = A(2) * B(25) + A(10) * B(26) + A(18) * B(27) + A(26) * B(28)
    C(26) = C(26) + A(34) * B(29) + A(42) * B(30) + A(50) * B(31) + A(58) * B(32)
    C(27) = A(3) * B(25) + A(11) * B(26) + A(19) * B(27) + A(27) * B(28)
    C(27) = C(27) + A(35) * B(29) + A(43) * B(30) + A(51) * B(31) + A(59) * B(32)
    C(28) = A(4) * B(25) + A(12) * B(26) + A(20) * B(27) + A(28) * B(28)
    C(28) = C(28) + A(36) * B(29) + A(44) * B(30) + A(52) * B(31) + A(60) * B(32)
    C(29) = A(5) * B(25) + A(13) * B(26) + A(21) * B(27) + A(29) * B(28)
    C(29) = C(29) + A(37) * B(29) + A(45) * B(30) + A(53) * B(31) + A(61) * B(32)
    C(30) = A(6) * B(25) + A(14) * B(26) + A(22) * B(27) + A(30) * B(28)
    C(30) = C(30) + A(38) * B(29) + A(46) * B(30) + A(54) * B(31) + A(62) * B(32)
    C(31) = A(7) * B(25) + A(15) * B(26) + A(23) * B(27) + A(31) * B(28)
    C(31) = C(31) + A(39) * B(29) + A(47) * B(30) + A(55) * B(31) + A(63) * B(32)
    C(32) = A(8) * B(25) + A(16) * B(26) + A(24) * B(27) + A(32) * B(28)
    C(32) = C(32) + A(40) * B(29) + A(48) * B(30) + A(56) * B(31) + A(64) * B(32)
    C(33) = A(1) * B(33) + A(9) * B(34) + A(17) * B(35) + A(25) * B(36)
    C(33) = C(33) + A(33) * B(37) + A(41) * B(38) + A(49) * B(39) + A(57) * B(40)
    C(34) = A(2) * B(33) + A(10) * B(34) + A(18) * B(35) + A(26) * B(36)
    C(34) = C(34) + A(34) * B(37) + A(42) * B(38) + A(50) * B(39) + A(58) * B(40)
    C(35) = A(3) * B(33) + A(11) * B(34) + A(19) * B(35) + A(27) * B(36)
    C(35) = C(35) + A(35) * B(37) + A(43) * B(38) + A(51) * B(39) + A(59) * B(40)
    C(36) = A(4) * B(33) + A(12) * B(34) + A(20) * B(35) + A(28) * B(36)
    C(36) = C(36) + A(36) * B(37) + A(44) * B(38) + A(52) * B(39) + A(60) * B(40)
    C(37) = A(5) * B(33) + A(13) * B(34) + A(21) * B(35) + A(29) * B(36)
    C(37) = C(37) + A(37) * B(37) + A(45) * B(38) + A(53) * B(39) + A(61) * B(40)
    C(38) = A(6) * B(33) + A(14) * B(34) + A(22) * B(35) + A(30) * B(36)
    C(38) = C(38) + A(38) * B(37) + A(46) * B(38) + A(54) * B(39) + A(62) * B(40)
    C(39) = A(7) * B(33) + A(15) * B(34) + A(23) * B(35) + A(31) * B(36)
    C(39) = C(39) + A(39) * B(37) + A(47) * B(38) + A(55) * B(39) + A(63) * B(40)
    C(40) = A(8) * B(33) + A(16) * B(34) + A(24) * B(35) + A(32) * B(36)
    C(40) = C(40) + A(40) * B(37) + A(48) * B(38) + A(56) * B(39) + A(64) * B(40)
    C(41) = A(1) * B(41) + A(9) * B(42) + A(17) * B(43) + A(25) * B(44)
    C(41) = C(41) + A(33) * B(45) + A(41) * B(46) + A(49) * B(47) + A(57) * B(48)
    C(42) = A(2) * B(41) + A(10) * B(42) + A(18) * B(43) + A(26) * B(44)
    C(42) = C(42) + A(34) * B(45) + A(42) * B(46) + A(50) * B(47) + A(58) * B(48)
    C(43) = A(3) * B(41) + A(11) * B(42) + A(19) * B(43) + A(27) * B(44)
    C(43) = C(43) + A(35) * B(45) + A(43) * B(46) + A(51) * B(47) + A(59) * B(48)
    C(44) = A(4) * B(41) + A(12) * B(42) + A(20) * B(43) + A(28) * B(44)
    C(44) = C(44) + A(36) * B(45) + A(44) * B(46) + A(52) * B(47) + A(60) * B(48)
    C(45) = A(5) * B(41) + A(13) * B(42) + A(21) * B(43) + A(29) * B(44)
    C(45) = C(45) + A(37) * B(45) + A(45) * B(46) + A(53) * B(47) + A(61) * B(48)
    C(46) = A(6) * B(41) + A(14) * B(42) + A(22) * B(43) + A(30) * B(44)
    C(46) = C(46) + A(38) * B(45) + A(46) * B(46) + A(54) * B(47) + A(62) * B(48)
    C(47) = A(7) * B(41) + A(15) * B(42) + A(23) * B(43) + A(31) * B(44)
    C(47) = C(47) + A(39) * B(45) + A(47) * B(46) + A(55) * B(47) + A(63) * B(48)
    C(48) = A(8) * B(41) + A(16) * B(42) + A(24) * B(43) + A(32) * B(44)
    C(48) = C(48) + A(40) * B(45) + A(48) * B(46) + A(56) * B(47) + A(64) * B(48)
    C(49) = A(1) * B(49) + A(9) * B(50) + A(17) * B(51) + A(25) * B(52)
    C(49) = C(49) + A(33) * B(53) + A(41) * B(54) + A(49) * B(55) + A(57) * B(56)
    C(50) = A(2) * B(49) + A(10) * B(50) + A(18) * B(51) + A(26) * B(52)
    C(50) = C(50) + A(34) * B(53) + A(42) * B(54) + A(50) * B(55) + A(58) * B(56)
    C(51) = A(3) * B(49) + A(11) * B(50) + A(19) * B(51) + A(27) * B(52)
    C(51) = C(51) + A(35) * B(53) + A(43) * B(54) + A(51) * B(55) + A(59) * B(56)
    C(52) = A(4) * B(49) + A(12) * B(50) + A(20) * B(51) + A(28) * B(52)
    C(52) = C(52) + A(36) * B(53) + A(44) * B(54) + A(52) * B(55) + A(60) * B(56)
    C(53) = A(5) * B(49) + A(13) * B(50) + A(21) * B(51) + A(29) * B(52)
    C(53) = C(53) + A(37) * B(53) + A(45) * B(54) + A(53) * B(55) + A(61) * B(56)
    C(54) = A(6) * B(49) + A(14) * B(50) + A(22) * B(51) + A(30) * B(52)
    C(54) = C(54) + A(38) * B(53) + A(46) * B(54) + A(54) * B(55) + A(62) * B(56)
    C(55) = A(7) * B(49) + A(15) * B(50) + A(23) * B(51) + A(31) * B(52)
    C(55) = C(55) + A(39) * B(53) + A(47) * B(54) + A(55) * B(55) + A(63) * B(56)
    C(56) = A(8) * B(49) + A(16) * B(50) + A(24) * B(51) + A(32) * B(52)
    C(56) = C(56) + A(40) * B(53) + A(48) * B(54) + A(56) * B(55) + A(64) * B(56)
    C(57) = A(1) * B(57) + A(9) * B(58) + A(17) * B(59) + A(25) * B(60)
    C(57) = C(57) + A(33) * B(61) + A(41) * B(62) + A(49) * B(63) + A(57) * B(64)
    C(58) = A(2) * B(57) + A(10) * B(58) + A(18) * B(59) + A(26) * B(60)
    C(58) = C(58) + A(34) * B(61) + A(42) * B(62) + A(50) * B(63) + A(58) * B(64)
    C(59) = A(3) * B(57) + A(11) * B(58) + A(19) * B(59) + A(27) * B(60)
    C(59) = C(59) + A(35) * B(61) + A(43) * B(62) + A(51) * B(63) + A(59) * B(64)
    C(60) = A(4) * B(57) + A(12) * B(58) + A(20) * B(59) + A(28) * B(60)
    C(60) = C(60) + A(36) * B(61) + A(44) * B(62) + A(52) * B(63) + A(60) * B(64)
    C(61) = A(5) * B(57) + A(13) * B(58) + A(21) * B(59) + A(29) * B(60)
    C(61) = C(61) + A(37) * B(61) + A(45) * B(62) + A(53) * B(63) + A(61) * B(64)
    C(62) = A(6) * B(57) + A(14) * B(58) + A(22) * B(59) + A(30) * B(60)
    C(62) = C(62) + A(38) * B(61) + A(46) * B(62) + A(54) * B(63) + A(62) * B(64)
    C(63) = A(7) * B(57) + A(15) * B(58) + A(23) * B(59) + A(31) * B(60)
    C(63) = C(63) + A(39) * B(61) + A(47) * B(62) + A(55) * B(63) + A(63) * B(64)
    C(64) = A(8) * B(57) + A(16) * B(58) + A(24) * B(59) + A(32) * B(60)
    C(64) = C(64) + A(40) * B(61) + A(48) * B(62) + A(56) * B(63) + A(64) * B(64)
end subroutine mul8x8

end module mul8mod

The Julia code looks similar, but I first extract all the elements of the inputs, work on the scalars, and then insert them. I found that that works better in Julia, but worse in Fortran.

The expression looks super simple, like there should be no issue vectorizing it. Julia does so beautifully. Updating an 8x8 matrix in place:

# Julia benchmark; using YMM vectors
@benchmark mul!($c8, $a8, $b8)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     57.059 ns (0.00% GC)
  median time:      58.901 ns (0.00% GC)
  mean time:        59.522 ns (0.00% GC)
  maximum time:     83.196 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     984

This works well.

Compiling the Fortran code with: gfortran-trunk -march=native -Ofast -mprefer-vector-width=256 -shared -fPIC mul8module1.F08 -o libmul8mod1v15.so

Benchmark results:

# gfortran, using XMM vectors; code was unrolled 8x8 matrix multiplication
@benchmark mul8v15!($c8, $a8, $b8)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     122.175 ns (0.00% GC)
  median time:      128.373 ns (0.00% GC)
  mean time:        128.643 ns (0.00% GC)
  maximum time:     194.090 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     905

Takes about twice as long. Looking at the assembly with -S reveals it ignored my -mprefer-vector-width=256, and used xmm registers instead. This is also more or less what I get in Julia when I use pointers instead of arrays or mutable structs (when given pointers Julia assumes aliasing and compiles a slower version).

I tried a variety of variations on generating Fortran code (eg, sum(va * vb) statements, were va and vb are 4-length vectors), but the simplest was just calling the intrinsic function matmul. Compiling matmul (for known 8x8 size) without -mprefer-vector-width=256,

# gfortran using XMM vectors generated from intrinsic matmul function
@benchmark mul8v2v2!($c8, $a8, $b8)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     92.983 ns (0.00% GC)
  median time:      96.366 ns (0.00% GC)
  mean time:        97.651 ns (0.00% GC)
  maximum time:     166.845 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     954

and compiling WITH it:

# gfortran using YMM vectors with intrinsic matmul
@benchmark mul8v2v1!($c8, $a8, $b8)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     163.667 ns (0.00% GC)
  median time:      166.544 ns (0.00% GC)
  mean time:        168.320 ns (0.00% GC)
  maximum time:     277.291 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     780

The avx-free matmul looks really fast for only using xmm registers, but when coerced into ymm -- dreadful.

Any idea what's going on? I want to understand why when instructed to do the same thing, and generating extremely similar assembly, one is so dramatically faster than the other.

FWIW, the input data is 8 byte aligned. I tried 16 byte aligned inputs, and it didn't seem to make a real difference.

I took a look at the assembly produced by gfortran with (note, this is just the intrinsic matmul function):

gfortran-trunk -march=native -Ofast -mprefer-vector-width=256 -shared -fPIC -S mul8module2.F08 -o mul8mod2v1.s

and that from Julia/LLVM, gotten via @code_native mul!(c8, a8, b8) (the unrolled matrix multiplication).

I would be more than happy to share all the assembly or anything else if someone is willing to take a look, but I'd hit the character limit on this post if I included it here.

Both correctly used ymm registers, and lots of vfmadd__pd instructions, also with lots of vmovupd, vmulpd, and vmovapd.

The biggest difference I noticed is that while LLVM used lots of vbroadcastsd, gcc instead has piles of vunpcklpd and vpermpd instructions.

A brief sample; gcc:

vpermpd $216, %ymm7, %ymm7
vpermpd $216, %ymm2, %ymm2
vpermpd $216, %ymm3, %ymm3
vpermpd $216, %ymm5, %ymm5
vunpckhpd   %ymm6, %ymm4, %ymm4
vunpcklpd   %ymm7, %ymm2, %ymm6
vunpckhpd   %ymm7, %ymm2, %ymm2
vunpcklpd   %ymm5, %ymm3, %ymm7
vpermpd $216, %ymm15, %ymm15
vpermpd $216, %ymm4, %ymm4
vpermpd $216, %ymm0, %ymm0
vpermpd $216, %ymm1, %ymm1
vpermpd $216, %ymm6, %ymm6
vpermpd $216, %ymm7, %ymm7
vunpckhpd   %ymm5, %ymm3, %ymm3
vunpcklpd   %ymm15, %ymm0, %ymm5
vunpckhpd   %ymm15, %ymm0, %ymm0
vunpcklpd   %ymm4, %ymm1, %ymm15
vunpckhpd   %ymm4, %ymm1, %ymm1
vunpcklpd   %ymm7, %ymm6, %ymm4
vunpckhpd   %ymm7, %ymm6, %ymm6

Julia/LLVM:

vbroadcastsd    8(%rax), %ymm3
vbroadcastsd    72(%rax), %ymm2
vbroadcastsd    136(%rax), %ymm12
vbroadcastsd    200(%rax), %ymm8
vbroadcastsd    264(%rax), %ymm10
vbroadcastsd    328(%rax), %ymm15
vbroadcastsd    392(%rax), %ymm14
vmulpd  %ymm7, %ymm0, %ymm1
vmulpd  %ymm11, %ymm0, %ymm0
vmovapd %ymm8, %ymm4

Could this explain the difference? Why would gcc be so poorly optimized here? Is there any way I can help it, so that it could generate code more comparable to LLVM?

Overall, gcc tends to outperform Clang in benchmarks (eg, on Phoronix)... Maybe I could try Flang (LLVM backend to Fortran), as well as Eigen (with g++ and clang++).

To reproduce, the matmul intrinsic function:

module mul8mod

implicit none

contains

subroutine intrinsic_mul8x8(A, B, C)
    real(8), dimension(8,8), intent(in) :: A, B
    real(8), dimension(8,8), intent(out) :: C

    C = matmul(A, B)

end subroutine

end module mul8mod

Compiled as above, and Julia code to reproduce benchmarks:

#Pkg.clone("https://github.com/chriselrod/TriangularMatrices.jl")
using TriangularMatrices, BenchmarkTools, Compat
a8 = randmat(8); b8 = randmat(8); c8 = randmat(8);
import TriangularMatrices: mul!
@benchmark mul!($c8, $a8, $b8)
@code_native mul!(c8, a8, b8) 

# after compiling into the shared library in libmul8mod2v2.so
# If compiled outside the working directory, replace pwd() accordingly
const libmul8path2v1 = joinpath(pwd(), "libmul8mod2v1.so")

function mul8v2v1!(C, A, B)
    ccall((:__mul8mod_MOD_intrinsic_mul8x8, libmul8path2v1),
        Cvoid,(Ptr{Cvoid},Ptr{Cvoid},Ptr{Cvoid}),
        pointer_from_objref(A),
        pointer_from_objref(B),
        pointer_from_objref(C))
    C
end
@benchmark mul8v2v1!($c8, $a8, $b8)

EDIT:

Thanks for the responses everyone!

Because I noticed that the code with the broadcasts is dramatically faster, I decided to rewrite my code-generator to encourage broadcasting. Generated code now looks more like this:

            C[1] = B[1] * A[1]
            C[2] = B[1] * A[2]
            C[3] = B[1] * A[3]
            C[4] = B[1] * A[4]
            C[5] = B[1] * A[5]
            C[6] = B[1] * A[6]
            C[7] = B[1] * A[7]
            C[8] = B[1] * A[8]
            C[1] += B[2] * A[9]
            C[2] += B[2] * A[10]
            C[3] += B[2] * A[11]
            C[4] += B[2] * A[12]
            C[5] += B[2] * A[13]
            C[6] += B[2] * A[14]
            C[7] += B[2] * A[15]
            C[8] += B[2] * A[16]
            C[1] += B[3] * A[17]
            ...

I am intending for the compiler to broadcast B, and then use repeated vectorized fma instructions. Julia really liked this rewrite:

# Julia benchmark; using YMM vectors
@benchmark mul2!($c, $a, $b)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     45.156 ns (0.00% GC)
  median time:      47.058 ns (0.00% GC)
  mean time:        47.390 ns (0.00% GC)
  maximum time:     62.066 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     990

Figuring it was llvm being smart, I also built Flang (Fortran frontend to llvm):

# compiled with
# flang -march=native -Ofast -mprefer-vector-width=256 -shared -fPIC mul8module6.f95 -o libmul8mod6v2.so
@benchmark mul8v6v2!($c, $a, $b)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     51.322 ns (0.00% GC)
  median time:      52.791 ns (0.00% GC)
  mean time:        52.944 ns (0.00% GC)
  maximum time:     83.376 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     988

This is also really good. gfortran still refused to use broadcasts, and was still slow.

I've still have questions on how best to generate code. Encouraging broadcasts is obviously the way to go. Right now, I'm basically doing matrix * vector multiplication, and then repeating it for every column of B. So my written code loops over A once per column of B. I do not know if that is what the compiler is actually doing, or if some other pattern would lead to faster code.

The point of optimizing multiplication of tiny matrices is as a kernel for a recursive algorithm for multiplying larger matrices. So I also need to figure out the best way to handle different sizes. This algorithm is far better for 8x8 than it is other sizes. For nrow(A) % 4 (ie, if A has 10 rows, 10 % 4 = 2) I used the old approach for the remainder, after the broadcastable block.

But for 10x10 matrices, it takes 151 ns. 12 is perfectly divisible by 4, but it takes 226. If this approach scaled with O(n^3), the times should be 91 ns and 158 ns respectively. I am falling well short. I think I need to block down to a very small size, and try and get as many 8x8 as possible.

It may be the case that 8x8 ought to be the maximum size.

This kind of basic block vectorization is really pushing the limits of current compiler technology. So differences such as this are to be expected since not all compilers are equal and not all compilers have the same set of optimization passes. — Mysticial, Jun 08 '18 at 20:59
There is no reason to expect matching results from the same high level source code with different compilers or different settings from the same compiler family. — old_timer, Jun 09 '18 at 03:11
Note that `vpermpd ymm` is relatively slow on AMD Ryzen. It's throughput is one instruction per 2 CPU cycles. On the other hand `vbroadcastsd` has a throughput of 2 instructions per cycle. Probably gcc's code isn't too bad on Intel cpu's, which have a faster `vpermpd`. See also [Agner Fog's Instruction tables](http://www.agner.org/optimize/instruction_tables.pdf) or [InstLatX64](http://users.atw.hu/instlatx64/AuthenticAMD0800F11_K17_Zen_InstLatX64.txt). — wim, Jun 09 '18 at 09:28
Can you add a comment / note inside the code-formatting of each benchmark result to say what it's for? e.g. gcc-generated code using YMM vectors? And BTW, yes gcc's code looks terrible with that many shuffles. `vbroadcastsd` doesn't use any shuffle uops at all, it just runs on a load port (on both Intel and AMD). L1d cache is very fast, so doing broadcast loads of one matrix and regular vector loads of the other is probably better than loading once and shuffling even on Intel CPUs, just for total uop throughput. AVX-512 shuffles with 2 input regs might change that (`vpermt2q`). — Peter Cordes, Jun 09 '18 at 11:46
gfortran seems to struggle with vectorizing these blocks. It however did relatively well when I tried blocks for Cholesky decompositions or inverting triangular matrices. I'm not surprised that results are different with different compilers and options, but I would like my high level code run quickly. Part of that is trying to understand what happens as it gets compiled, so I can write it in a way friendlier to the compiler. There was a great payoff in recognizing the importance of broadcasts, although ironically only llvm seemed to benefit. gfortran still refuses to broadcast. — Chris Elrod, Jun 09 '18 at 19:16
wim, interesting -- thanks for the references. Glad to see that Intel processors are just as fast with the broascast statements, so that aiming for them seems like the way to go. I have a laptop with (I believe) a haswell, so I'll try there. Also has ifort, which'd be worth testing. — Chris Elrod, Jun 09 '18 at 19:33
Would be cool to have access to skylake and avx-512! I added comments on the benchmarks, plus two new ones after now better encouraging the broadcasts. for C = A * B (or C += A*B) Peter, do you think it's worth repeatedly loading A in the loop, once per column of C? I'm having trouble thinking of another order to save on moving A in and out of registers. — Chris Elrod, Jun 09 '18 at 19:34
*The point of optimizing multiplication of tiny matrices is as a kernel for a recursive algorithm for multiplying larger matrices.* Are you *sure* you need to implement large-matrix matmul yourself? You're very unlikely to get auto-vectorization to make code that beats a hand-tuned BLAS DGEMM implementation or manually-vectorized C/C++ intrinsics (Eigen). If you're doing this to learn more about performance tuning, matmul is a very well-studied problem, so you can learn about cache-blocking and so on. In that case sure, there's lots to play with. — Peter Cordes, Jun 09 '18 at 20:08
Or are you trying to do some other work on the fly during matmul while data is in registers, instead of spending CPU front-end throughput on loads/stores? i.e. to increase your arithmetic intensity by avoiding a separate pass over your data for something else simple? — Peter Cordes, Jun 09 '18 at 20:09
Use @PeterCordes or whatever when you reply to comments, so people get notified. *Peter, do you think it's worth repeatedly loading A in the loop, once per column of C?* That might be a good choice for the compiler to make, instead of doing a separate `vmovdqa` load, especially if the source data is aligned. A memory operand for `vfmaddpd` can micro-fuse ([if the compiler avoids an indexed addressing mode](https://stackoverflow.com/questions/26046634/micro-fusion-and-addressing-modes)). OoO exec should hide FMA latency well, but it won't hurt to have multiple vector accumulators going... — Peter Cordes, Jun 09 '18 at 20:26
... and you'd start to run out of registers with all the B broadcasts done + accumulators for parts of C. But in Fortran / Julia / C++ all you can do is try to hand-hold the compiler towards that decision. If its cost model doesn't reflect the microarch in that level of detail, or its code-gen heuristics just don't work that way, a compiler is going to want to do a separate load for data from A and keep it in a register. Maybe without `-ffast-math` the order of operations can force something, but IDK if auto-vectorization will give you good code on multiple compilers! — Peter Cordes, Jun 09 '18 at 20:31
@PeterCordes Ha ha, no, I'm not sure I have to. I will only be using this through Julia, which is linked to OpenBLAS by default (and optionally MKL instead), which has piles of optimized assembly for its kernels. I would be very happy to be within 2x of OpenBLAS for matrices that still fit into L3 cache. There is a lot of overhead in BLAS calls though -- for example, the 8x8 matrix multiplication takes over 300 ns! More than 6x slower than my best time in Julia (although the relative difference is smaller for non-8x8 matrices). — Chris Elrod, Jun 09 '18 at 22:08
I have three reasons. Number 1 is of course fun/educational. Number 2 is that I'm a statistics graduate student, and my research at the moment is focusing on speeding up simulations where you may have to solve models thousands of times. Currently, I've only been focusing on small parameter models, where avoiding BLAS and LAPACK are best. I need to do a lot of Cholesky factorizations, inversions, and multiplication. I figured it would be nice to push the tipping point back as far as possible. Reason number 3 is supporting arbitrary user types, such as dual numbers for automatic differentiation. — Chris Elrod, Jun 09 '18 at 22:12
I added the `-ffast-math` to both because I wanted `-fassociative-math` to encourage the compiler to take liberties to get good vectorization. LLVM obviously did a lot in the first cases, where it found a much better way to vectorize (via the broadcasts) than I had in mind. gfortran still didn't, even when I explicitly created 4-length vectors [B(1),B(1),B(1),B(1)] that I used for the operations. Just tried recompiling that one, now with -O3 instead of -Ofast, and still no vbroadcastsd in the assembly. — Chris Elrod, Jun 09 '18 at 22:24
Yeah for small matrices a generic BLAS function will be slower if you get the compiler to make good code for a fixed-size 8x8 matrix. I'd suggest reading Agner Fog's microarch guide to learn how the pipeline works, so you'll have a general understanding of why one compiler output is (or might be) faster than another. Don't spend too much time going down the rabbit hole of optimization unless / until your computations aren't near-instant. Too bad you can't get gcc's auto-vectorizer to do a good job, but that's not uncommon, unfortunately. — Peter Cordes, Jun 09 '18 at 22:25
Good point on interweaving operations and the multiplications. That is definitely a potential advantage of tightly integrating. I'll keep that possibility in mind while looking for places to optimize. Perhaps the most straightforward example would be that, rather than explicitly creating and allocating a matrix, I could generate it during an operation. Another use case is simply making it easier to take advantage of any special structures in a matrix (although plenty of well optimized software exists already for that too). — Chris Elrod, Jun 09 '18 at 22:28
My adviser agrees with you that I should stop wasting time with this rabbit hole, and realistically I'm well aware that I'm spending more time than I'm saving, but -- it's fun! =P Long term though, I think there are a lot of benefits. I'll check out the "Optimizing software in C++" guide. Looks like a great resource. — Chris Elrod, Jun 09 '18 at 22:38
@PeterCordes I have to move onto something else for now, but I wrote up a brief summary here: https://discourse.julialang.org/t/we-can-write-an-optimized-blas-library-in-pure-julia/11634 tldr: edged out OpenBLAS on Ryzen up to around 128x128 matrices. Still fairly competitive at 256. Slower when matrix sizes weren't divisible by 8, but that's something I'd address once I get back to this. — Chris Elrod, Jun 13 '18 at 17:30
@ChrisElrod: If you find performance tuning / CPU architecture stuff fascinating and more interesting than the result of actual problem you're optimizing, it's a career option. There is some small demand for performance tuning experts. — Peter Cordes, Jun 14 '18 at 00:14
@ChrisElrod FYI, if this is still something that matters to you, the current git version of OpenBLAS now contains a number of ASM improvements, including replacing some `vpermpd` isns for better Zen and especially Zen2 performance. — uLoop, Jul 28 '19 at 16:42
Regarding Fortran: `gfortran` version 10 will only use YMM registers and associated operations if compiled with `-O3`. — mobiuseng, Apr 23 '21 at 20:06

Aaron Altman · Answer 1 · 2019-04-03T18:15:41.707

This would be a good case for profiling and performance analysis using a low-level tool that can expose microarchitectural bottlenecks. While I have not used AMD μProf, my experience with Intel's equivalents like XTU suggest that you'll get the best possible results when using a tool written by someone working for the same company and maybe even sitting near the people responsible for the hardware implementation of Ryzen's AVX instructions.

Start with an event-based profile of your application when running through a large number of iterations. General areas you could look for are things like whether one or the other style of generated assembly makes better use of execution ports or related backend CPU resources, or whether they behave differently with respect to cache and memory accesses. None of that would answer your conceptual question of why gcc has chosen to generate assembly in one style and LLVM in another, but it might tell you more at a hardware level about why the LLVM-generated assembly runs faster.

Why is GCC's AVX slower while LLVM's faster?

1 Answers1