0

I have the following simplified ReLU simulation code that I am trying to optimize. The code uses a ternary operation which is perhaps coming in the way of automatic vectorization by the compiler. How can I vectorize this code?

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <mkl.h>

void relu(double* &Amem, double* Z_curr, int bo)
{
    for (int i=0; i<bo; ++i) {
        Amem[i] = Z_curr[i] > 0 ? Z_curr[i] : Z_curr[i]*0.1;
    }
}

int main()
{
    int i, j;
    int batch_size = 16384;
    int output_dim = 21;
    // double* Amem = new double[batch_size*output_dim];
    // double* Z_curr = new double[batch_size*output_dim];
    double* Amem = (double *)mkl_malloc(batch_size*output_dim*sizeof( double ), 64 );
    double* Z_curr = (double *)mkl_malloc(batch_size*output_dim*sizeof( double ), 64 );
    memset(Amem, 0, sizeof(double)*batch_size*output_dim);
    for (i=0; i<batch_size*output_dim; ++i) {
        Z_curr[i] = -1+2*((double)rand())/RAND_MAX;
    }
    relu(Amem, Z_curr, batch_size*output_dim);
}

To compile it, if you have MKL then use the following, otherwise plain g++ -O3.

g++ -O3 ex.cxx -L${MKLROOT}/lib/intel64 -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5

So far, I have tried adding -march=skylake-avx512 as a compiler option, but it does not vectorize the loop as I found using option -fopt-info-vec-all for compilation:

ex.cxx:9:16: missed: couldn't vectorize loop
ex.cxx:9:16: missed: not vectorized: control flow in loop.
ex.cxx:6:6: note: vectorized 0 loops in function.
ex.cxx:9:16: missed: couldn't vectorize loop
ex.cxx:9:16: missed: not vectorized: control flow in loop.

and this is the time it takes currently at my end:

time ./a.out
real    0m0.034s
user    0m0.026s
sys     0m0.009s
Tania
  • 418
  • 3
  • 20
  • 1
    In what case is `Z_curr[i] ? Z_curr[i] : Z_curr[i]*0.1` different from `Z_curr[i]`? – chtz May 30 '21 at 15:06
  • Actually this should be ```Z_curr[i] > 0 ? Z_curr[i] : Z_curr[i]*0.1;``` I am correcting it. – Tania May 30 '21 at 15:23
  • 1
    It does do some vectorization if I do `Amem[i] = Z_curr[i] * (std::copysign(0.45, Z_curr[i]) + 0.55);`, which should be equivalent unless you depend on signed zero or NaN behavior. – Ruslan May 30 '21 at 15:51
  • 2
    @Ruslan Much simpler would be `max(Z_curr[i], 0.1*Z_curr[i])` (should even work for infinity or NaN) – chtz May 30 '21 at 17:04

1 Answers1

3

There is usually no benefit to pass a pointer by reference (unless you want to modify the pointer itself). Furthermore, you can help your compiler using the (non-standard) __restrict keyword, telling it that no aliasing happens between input and output (of course, this will likely give wrong results, if e.g., Amem == Z_curr+1 -- but Amem == Z_curr should (in this case) be fine).

void relu(double* __restrict Amem, double* Z_curr, int bo)

Using that alone, clang actually is capable of vectorizing your loop using vcmpltpd and masked moves (for some reasons, only using 256bit registers).

If you simplify your expression to std::max(Z_curr[i], 0.1*Z_curr[i]) even gcc easily is capable of vectorizing it: https://godbolt.org/z/eTv4PnMWb

Generally, I would suggest compiling crucial routines of your code with different compilers and different compile options (sometimes trying -ffast-math can show you ways to simplify your expressions) and have a look at the generated code. For portability you could then re-translate the generated code into intrinsics (or leave it as is, if every compiler you care about gives good-enough results).

For completeness, here is a possible manually vectorized implementation using intrinsics:

void relu_avx512(double* __restrict Amem, double* Z_curr, int bo)
{
    int i;
    for (i=0; i<=bo-8; i+=8)
    {
        __m512d z = _mm512_loadu_pd(Z_curr+i);
        __mmask8 positive = _mm512_cmplt_pd_mask (_mm512_setzero_pd(), z);
        __m512d res = _mm512_mask_mul_pd(z, positive, z, _mm512_set1_pd(0.9));
        _mm512_storeu_pd(Amem+i, res);
    }
    // remaining elements in scalar loop
    for (; i<bo; ++i) {
        Amem[i] = 0.0 < Z_curr[i] ? Z_curr[i] : Z_curr[i]*0.1;;
    }
}

Godbolt: https://godbolt.org/z/s6br5KEEc (if you compile this with -O2 or -O3 on clang, it will heavily unroll the cleanup loop, even though it can't have more than 7 iterations. Theoretically, you could do the remaining elements with a masked or overlapping store (or maybe you have use-cases where the size is guaranteed to be a multiple of 8 and you can leave it away).

chtz
  • 17,329
  • 4
  • 26
  • 56
  • Got it. There is a backward pass that has an expression like ```dZcurr[i] = (Z_curr[i] > 0) ? dAcurr[i] : 0.1*dAcurr[i];``` In here i think we do have to deal with the ternary though? Also, thanks for the godbolt site that gives a far better picture of the results of the compilation. – Tania May 30 '21 at 17:54
  • 1
    For the backwards expression, you may need to use intrinsics, or switch to clang if that is possible for you. – chtz May 30 '21 at 17:56
  • 2
    You can force the use of 512-bit registers with `-mprefer-vector-width=512`. This isn't the default because it sometimes actually hurts performance, see https://stackoverflow.com/questions/52523349/avx-512-vs-avx2-performance-for-simple-array-processing-loops/52523647#52523647 – Nate Eldredge May 30 '21 at 18:43
  • @chtz: thank you for the code snippet with intrinsics. i also wanted to know your opinion about parallelizing the ```for (i=0; i<=bo-8; i+=8)``` among multiple threads. I tried adding #pragma omp for, but it increases the runtime many-folds. – Tania May 30 '21 at 22:28
  • @NateEldredge: thank you for the suggestion, it helps the performance in my case. – Tania May 30 '21 at 22:29
  • 1
    @Tania regarding multithreading better ask a new question (if you can't find an existing one). But your problem size is probably just too small. – chtz May 31 '21 at 10:47