23

I have a i5-4250U which has AVX2 and FMA3. I am testing some dense matrix multiplication code in GCC 4.8.1 on Linux which I wrote. Below is a list of three difference ways I compile.

SSE2:     gcc matrix.cpp -o matrix_gcc -O3 -msse2 -fopenmp
AVX:      gcc matrix.cpp -o matrix_gcc -O3 -mavx  -fopenmp
AVX2+FMA: gcc matrix.cpp -o matrix_gcc -O3 -march=native -fopenmp -ffast-math

The SSE2 and AVX version are clearly different in performance. However, the AVX2+FMA is no better than the AVX version. I don't understand this. I get over 80% of the peak flops of the CPU assuming there is no FMA but I think I should be able to do a lot better with FMA. Matrix Multiplication should benefit directly from FMA. I'm essentially doing eight dot products at once in AVX. When I check march=native it gives:

cc -march=native -E -v - </dev/null 2>&1 | grep cc1 | grep fma 
...-march=core-avx2 -mavx -mavx2 -mfma -mno-fma4 -msse4.2 -msse4.1 ...

So I can see it's enabled (just to be sure I added -mfma but it makes not difference). ffast-math should allow a relaxed floating point model How to use Fused Multiply-Add (FMA) instructions with SSE/AVX

Edit:

Based on Mysticial's comments I went ahead and used _mm256_fmadd_ps and now the AVX2+FMA version is faster. I'm not sure why the compiler won't do this for me. I'm now getting about 80 GFLOPS (110% of the peak flops without FMA) for over 1000x1000 matrices. In case anyone does not trust my peak flop calculation here is what I did.

peak flops (no FMA) = frequency * simd_width * ILP * cores
                    = 2.3GHZ    * 8          * 2   * 2     =  73.2 GFLOPS
peak flops (with FMA) = 2 * peak flops (no FMA)            = 146.2 GFLOPS

My CPU in turbo mode when using both cores is 2.3 GHz. I get 2 for ILP because Ivy Bridge can do one AVX multiplication and one AVX addition at the same time (and I have unrolled the loop several times to ensure this).

I'm only geting about 55% of the peak flops (with FMA). I'm not sure why but at least I'm seeing something now.

One side effect is that I now get a small error when I compare to a simple matrix multiplication algorithm I know I trust. I think that's due to the fact that FMA only has one rounding mode instead of what would normally be two (which ironically breaks IEEE floating point rules even though it's probably better).

Edit:

Somebody needs to redo How do I achieve the theoretical maximum of 4 FLOPs per cycle? but do 8 double floating point FLOPS per cycle with Haswell.

Edit

Actually, Mysticial has updated his project to support FMA3 (see his answer in the link above). I ran his code in Windows8 with MSVC2012 (because the Linux version did not compile with FMA support). Here are the results.

Testing AVX Mul + Add:
Seconds = 22.7417
FP Ops  = 768000000000
FLOPs   = 3.37705e+010
sum = 17.8122

Testing FMA3 FMA:
Seconds = 22.1389
FP Ops  = 1536000000000
FLOPs   = 6.938e+010
sum = 333.309

That's 69.38 GFLOPS for FMA3 for double floating point. For single floating point I need to double it so that's 138.76 SP GFLOPS. I calculate my peak is 146.2 SP GFLOPS. That's 95% of the peak! In other words I should be able to improve my GEMM code quite a bit (although it's already quite a bit faster than Eigen).

Community
  • 1
  • 1
Z boson
  • 32,619
  • 11
  • 123
  • 226
  • 6
    Check the assembly to see if the compiler is actually using FMA instructions. If it isn't and you have already tried a bunch of compiler options (including `-ffast-math`), then your only option is to use the intrinsics manually. (or just use a library that uses it) In other words, the compiler might not be smart enough to generate FMA even if you specify `-mfma` and `-ffast-math`. – Mysticial Jan 08 '14 at 17:31
  • I took your advice @Mysticial and used `_mm256_fmadd_ps` and now I'm getting over 110% of the peak flops (calculated without FMA). I'm getting over 80 GFLOPS on my little 15 watt CPU (in an Intel NUC). But I get some small error when I compare to my base calculation. I assume that's due to the new rounding mode of FMA. I'm going to have to change what I compare to. Not sure how to do that. On the other hand I'm not getting nearlly as large a boost as I expected (I was getting 80% of the peak flops before). – Z boson Jan 08 '14 at 19:34
  • That's pretty typical for optimization. When you make something faster, you expose new bottlenecks. So you're probably not bound by computation anymore. Are you maxing out the cache bandwidth? Your load-store throughput? – Mysticial Jan 08 '14 at 19:38
  • 1
    @Zboson: yes, using FMA will perturb the results slightly (so will tiling and vectorization and most of the other tricks that go into a fast matrix multiply). Usually one tests these operations by ensuring that the relative error in a some matrix norm is bounded by a threshold linear in the size of the matrix. – Stephen Canon Jan 08 '14 at 19:53
  • 1
    @Zboson For your second edit, click through the github project that I linked from the Flops question. I recently updated it to include FMA3. – Mysticial Jan 08 '14 at 19:54
  • @Mysticial, I think my load-store throughput is fine (though that's just a guess because I think I understand it). Probalby it's a cache issue since the cache is still a bit confusing to me. When I compile in MSVC or ICC only get 50% of the peak flops (no FMA) whereas GCC gets 80% of the peak. That's strange. I'm embarassed to admit that I'm mostly a printf debugger still. How do I check my cache bandwidth? – Z boson Jan 08 '14 at 20:05
  • @Zboson I've never actually tried to measure cache bandwidth. But you can probably do it fairly easily by doing memory copies with SIMD on different sizes of data. (I'm also mostly a print debugger as well.) – Mysticial Jan 08 '14 at 20:09
  • @Mysticial, I ran your flops code. In linux you don't compile for fma. I tried to fix that but I got a result higher than the peak for fma. I have not looked into yet. I rebooted into windows and ran your flops code and posted my results to my question. I get 95% of the peak with your code! Thanks for your code. There are a lot of code gems in there I'm going to learn from. – Z boson Jan 09 '14 at 11:41
  • @Zboson Ah yes. I haven't updated the Linux build scripts yet. But glad to hear that you at least got the Windows version working to your satisfaction. :) – Mysticial Jan 09 '14 at 15:21
  • Did you show your code? I can't see it, but I assume you used intrinsics like _mm_mul_pd(a,b) instead of writing a*b, which explains why gcc didn't generate fma instructions. – Marc Glisse Feb 08 '14 at 12:04
  • @MarcGlisse, yes I'm using intrinsics. Something like `_mm256_add_ps(_mm256_mul_ps(areg0,breg0), tmp0)`. If I want GCC to do fma when a and b are AVX registers how do I dow this? You know more about this than me. Why don't you write up an answer? – Z boson Feb 08 '14 at 17:39
  • Greetings. It seems as though you've made some edits that change the question. [The etiquette for editing](http://meta.stackexchange.com/a/11476/215019) seems to be best summarised as "You edit to make things better, clearer, more effective -- never to change meaning." Perhaps in the future, if you feel the need to ask a different question, you could ask a new question rather than changing an already existing question? – autistic Dec 25 '15 at 10:23
  • @seb, I am no sure how I changed the question. I mostly added to it. But in any case this was over a year ago. I do things a bit different since then. – Z boson Dec 25 '15 at 10:38

2 Answers2

9

Only answering a very small part of the question here. If you write _mm256_add_ps(_mm256_mul_ps(areg0,breg0), tmp0), gcc-4.9 handles it almost like inline asm and does not optimize it much. If you replace it with areg0*breg0+tmp0, a syntax that is supported by both gcc and clang, then gcc starts optimizing and may use FMA if available. I improved that for gcc-5, _mm256_add_ps for instance is now implemented as an inline function that simply uses +, so the code with intrinsics can be optimized as well.

Marc Glisse
  • 7,550
  • 2
  • 30
  • 53
5

The following compiler options are sufficient to contract _mm256_add_ps(_mm256_mul_ps(a, b), c) to a single fma instruction now (e.g vfmadd213ps):

GCC 5.3:   -O2 -mavx2 -mfma
Clang 3.7: -O1 -mavx2 -mfma -ffp-contract=fast
ICC 13:    -O1 -march=core-avx2

I tried /O2 /arch:AVX2 /fp:fast with MSVC but it still does not contract (surprise surprise). MSVC will contract scalar operations though.

GCC started doing this since at least GCC 5.1.


Although -O1 is sufficient for this optimization with some compilers, always use at least -O2 for overall performance, preferably -O3 -march=native -flto and also profile-guided optimization.

And if it's ok for your code, -ffast-math.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Z boson
  • 32,619
  • 11
  • 123
  • 226