0

First I wrote this code:

void dgemm3(double* A, double* B, double* C, int n){
    register int i, j, k, n4 = n * 4;
    register double cij0, cij1, cij2, cij3;
    register double *a0, *a1, *a2, *a3, *b0, *b1, *b2, *b3;

    for (i=0; i<n; ++i){
        for (j=0; j<n; ++j){

            a0 = &A[i*n];
            a1 = a0 + 1;
            a2 = a1 + 1;
            a3 = a2 + 1;
            b0 = &B[j];
            b1 = b0 + n;
            b2 = b1 + n;
            b3 = b2 + n;
            cij0 = cij1 = cij2 = cij3 = 0;

            for(k = 0; k < n; k+=4, a0+=4, a1+=4, a2+=4, a3+=4, b0+=n4, b1+=n4, b2+=n4, b3+=n4){
                cij0 += *a0 * *b0;
                cij1 += *a1 * *b1;
                cij2 += *a2 * *b2;
                cij3 += *a3 * *b3;
            }

            *C++ = cij0 + cij1 + cij2 + cij3;
        }
}
}

and the I wrote this code using avx:

void dgemm_avx (double* A, double* B, double* C, int n) {
    for (int i=0; i<n; i++) {
        for (int j=0; j<n; j+=4)  {
            __m256d c0 = _mm256_setzero_pd();
            for (int k=0; k<n; k++)  {
                __m256d m1 = _mm256_broadcast_sd(A+i*n+k);
                __m256d m2 = _mm256_loadu_pd(B+k*n+j);
                __m256d m3 = _mm256_mul_pd(m1,m2);
                c0 = _mm256_add_pd(c0,m3);
            }
            _mm256_storeu_pd(C+i*n+j, c0);


        }
}
}

I expected the second one to be faster but it is not. But if I use -O1 flag on both of them then the second one is faster. Now my question is shouldn't the second one be faster even without optimization?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Mohammad
  • 55
  • 4
  • 1
    How big are the matrices? Also have you looked at [this question](https://stackoverflow.com/questions/7839925/using-avx-cpu-instructions-poor-performance-without-archavx?rq=1)? – Bob Jarvis - Слава Україні Dec 30 '22 at 13:15
  • 2
    *Now my question is shouldn't the second one be faster even without optimization?* - no, intrinsics often totally faceplant without optimization, because there are more function calls. – Peter Cordes Dec 30 '22 at 13:23
  • 1
    You should inspect the generated assembly for both. Maybe compile to assembly and add results to the question, even... Or, that would actually work as an answer, because the 2nd one is slower, because the generated assembly is slower. – hyde Dec 30 '22 at 13:27
  • @BobJarvis-СлаваУкраїні Thanks for your reply. I checked that link but I am not switching between SSE and AVX. And size for matrices are 512*512 , 1024*1024 till 4096*4096. second code is slower for all of these cases. – Mohammad Dec 30 '22 at 14:00
  • @PeterCordes Thanks for your reply. Is there a way that I can do these optimizations manually? with -O1 flag second code is twice faster. – Mohammad Dec 30 '22 at 14:02
  • @hyde thanks for your reply. I did that but it is like 1000 lines of code. And to be honest I couldn't wrap my head around it. – Mohammad Dec 30 '22 at 14:10
  • If for some reason you actually care about performance of debug builds, you could try `register __m256d m1 = ...` and so on to avoid some stores/reloads, but you won't avoid them all for the intrinsic call/return. [Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?](https://stackoverflow.com/q/53366394). You could probably use some GNU C native vector syntax instead of intrinsic function calls, like `c0 += m1*m2`. – Peter Cordes Dec 30 '22 at 14:17
  • Or if you mean actually doing *useful* optimizations (that will help in properly optimized `-O2` or `-O3` buids), you could unroll with 2 to 4 accumulators inside the inner loop, as in [Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)](https://stackoverflow.com/q/45113527) (which is about vectorizing a single dot product, but same concept applies here.) Or perhaps unroll over an outer loop so you can reuse the same load of one vector with a different vector, improving the ratio of FMAs to loads as well. – Peter Cordes Dec 30 '22 at 14:20
  • If you want code that runs fast, compiling with optimization enabled is necessary. At the default `-O0` (consistent debugging, compile fast) your only options are "more slow" or "somewhat less slow". – Peter Cordes Dec 30 '22 at 14:28
  • @PeterCordes Using register and c0+=m1*m2, now second code is a little faster even without optimization. This is so weird to me that c0+= m1*m2 is faster. As you mentioned, I should also consider loop unrolling. I thought that when using AVX, we cannot use loop unrolling. thanks a lot. – Mohammad Dec 30 '22 at 14:34
  • Look at the asm; it should be have fewer stores/reloads because it avoids function calls. Look at how `_mm256_add_ps` is defined in the header, for example, as a real function with `__attribute__((always_inline))`. But see [Why is this C++ wrapper class not being inlined away?](https://stackoverflow.com/q/54073295). `c0 += m1*m2` is also all one C statement, which would allow contraction into an FMA. Normally GCC does that anyway even across statements, but of course not with optimization disabled. – Peter Cordes Dec 30 '22 at 14:40
  • See [How to optimize these loops (with compiler optimization disabled)?](https://stackoverflow.com/a/32001196) for more general stuff about optimizing source code to be less bad with compiler optimization disabled. (Normally a total waste of time; what's the point of this? Just use at least `-O2` like a normal person if you want fast code.) – Peter Cordes Dec 30 '22 at 14:41
  • 1
    For debug builds, you can compile with `-Og` if `-O1` is already too much disturbing. Compiling with `-O0` is barely ever a good idea. For meaningful benchmarks, I would not do anything below `-O2` (you can then play around with individual flags if you like). – chtz Dec 30 '22 at 15:03
  • @PeterCordes Do you have any idea how can I multiply matrices as fast as numpy. I checked and it only took numpy 1.9 seconds to multiply two 4096*4096 matrices. I tried AVX for tile method. but it is very slow. any source that I can read about it would be great. – Mohammad Jan 02 '23 at 20:04
  • The BLAS libraries that NumPy uses are open source. So is the C++ Eigen library. If you tile carefully, and make good use of vector registers, you can get mostly L1d cache hits, and reuse enough load results that you're mostly doing 1 load per FMA, so you can get close to the 2 FMA per clock throughput bottleneck, not just the 2 load per clock bottleneck. NumPy may also be taking advantage of multiple cores via multithreading. – Peter Cordes Jan 03 '23 at 01:40

0 Answers0