2

I am having a simple c code as follows

void calculate_exp(float *out, float *in, int size) {
    for(int i = 0; i < size; i++) {
        out[i] = exp(in[i]);
    }
}

I wanted to optimize it using open-mp simd. I am new to open-mp and used few pragma's like 'omp simd', 'omp simd safelen' etc. But I am unable to generate the simd code. Can anybody help ?

mandar s
  • 21
  • 2
  • 2
    This doesn't appear to fall within the scope of OpenMP. You would call explicitly a library vector exponentiation function, or use a compiler such as icc which implements a short vector math library. You would want to avoid the mixed data types, e.g. by substituting expf() for exp(), unless you require the data type promotion. – tim18 Nov 13 '18 at 13:53
  • I wanted the code to run independent of compiler(at least gcc and clang) and independent of architecture(like arm neon or intel sse/avx). – mandar s Nov 14 '18 at 05:10
  • 1
    Example `exp_vect_d` is actually standard Openmp/C code, nothing compiler specific or platform specific. The answer shows that some compiler will generate better code if your arrays happen to be aligned at 32 bytes boundaries and if N is a multiple of 8, but you can forget about that if you want compiler/platform independent code. Nevertheless, not all compilers have the same `#pragma omp simd` capabilities. What works with one compiler, does not necessarily work with the other. – wim Nov 14 '18 at 14:25
  • You did not specify a compiler. GCC and ICC both can vectorize math functions. Clang can do it with `-fveclib` – Z boson Nov 15 '18 at 08:18

1 Answers1

3

You can use one of the following four alternatives to vectorize the exp function. Note that I have used expf (float) instead of exp, which is a double function. This Godbolt link shows that these functions are vectorized: Search for call _ZGVdN8v___expf_finite in the compiler generated code.

#include<math.h>

int exp_vect_a(float* x, float* y, int N) {
    /* Inform the compiler that N is a multiple of 8, this leads to shorter code */
    N = N & 0xFFFFFFF8;    
    x = (float*)__builtin_assume_aligned(x, 32); /* gcc 8.2 doesn't need aligned x and y  to generate `nice` code */
    y = (float*)__builtin_assume_aligned(y, 32); /* with gcc 7.3 it improves the generated code                   */
    #pragma omp simd             
    for(int i=0; i<N; i++) y[i] = expf(x[i]);
    return 0; 
}


int exp_vect_b(float* restrict x, float* restrict y, int N) {
    N = N & 0xFFFFFFF8;
    x = (float*)__builtin_assume_aligned(x, 32); /* gcc 8.2 doesn't need aligned x and y  to generate `nice` code */
    y = (float*)__builtin_assume_aligned(y, 32); /* with gcc 7.3 it improves the generated code                   */
    for(int i=0; i<N; i++) y[i] = expf(x[i]);
    return 0; 
}

/* This also vectorizes, but it doesn't lead to `nice` code */
int exp_vect_c(float* restrict x, float* restrict y, int N) {
    for(int i=0; i<N; i++) y[i] = expf(x[i]);
    return 0; 
}

/* This also vectorizes, but it doesn't lead to `nice` code */
int exp_vect_d(float* x, float* y, int N) {
    #pragma omp simd             
    for(int i=0; i<N; i++) y[i] = expf(x[i]);
    return 0; 
}

Note that Peter Cordes' comment is very relevant here: Function _ZGVdN8v___expf_finite might give slightly different results than expf because its focus is on speed, and not on special cases such as inputs which are infinite, subnormal, or not a number. Moreover, the accuracy is 4-ulp maximum relative error, which is probably slightly less accurate than the standard expf function. Therefore you need optimization level -Ofast (which allows less accurate code) instead of -O3 to get the code vectorized with gcc.

See this libmvec page for futher details.

The following test code compiles and runs successfully with gcc 7.3:

#include <math.h>
#include <stdio.h>
/* gcc expv.c -m64 -Ofast -std=c99 -march=skylake -fopenmp -lm */

int exp_vect_d(float* x, float* y, int N) {
    #pragma omp simd             
    for(int i=0; i<N; i++) y[i] = expf(x[i]);
    return 0; 
}

int main(){
    float x[32];
    float y[32];
    int i;
    int N = 32;

    for(i = 0; i < N; i++) x[i] = i/100.0f;
    x[10]=-89.0f;            /* exp(-89.0f)=2.227e-39 which is a subnormal number */
    x[11]=-1000.0f;          /* output: 0.0                                   */
    x[12]=1000.0f;           /* output: Inf.                                  */
    x[13]=0.0f/0.0f;         /* input: NaN: Not a number                      */
    x[14]=1e20f*1e20f;       /* input: Infinity                               */
    x[15]=-1e20f*1e20f;      /* input: -Infinity                              */
    x[16]=2.3025850929940f;  /* exp(2.3025850929940f)=10.0...                 */
    exp_vect_d(x, y, N);
    for(i = 0; i < N; i++) printf("x=%11.8e,  y=%11.8e\n", x[i], y[i]);
    return 0;
}
wim
  • 3,702
  • 19
  • 23
  • 2
    Important to point out that you had to use `-Ofast` (`-O3 -ffast-math`) to enable auto-vectorization of `expf`, and that's why it's directly calling `_ZGVdN8v___expf_finite` which only works for finite non-NaN inputs. With just `-O3`, you get `vmovss` scalar loads/stores. – Peter Cordes Nov 13 '18 at 16:29
  • I thought glibc's `__foo_finite` functions were normally used internally by the regular `foo` after checking the input for special cases. Are they actually different and less accurate, or is the SIMD `expf_finite` simply not precise to 0.5ulp because that would cost too much speed? Or more relevant here, *is* there an 0.5ulp vectorized `expf` anywhere in glibc / libmvec? And is the vectorized version worse than scalar `expf`? – Peter Cordes Nov 13 '18 at 23:45
  • 1
    @PeterCordes: Unfortunately, the accuracy of the standard `expf` is not in this [table](https://www.gnu.org/software/libc/manual/html_node/Errors-in-Math-Functions.html). Indeed the documentation suggests that the vectorized version is worse than the scalar version. I think 0.5ulp would be too expensive for the standard exp function (even a correctly rounded double precision `exp` is not exactly 0.5ulp). I don't know the exact details on glibc's math functions. – wim Nov 14 '18 at 00:21
  • Do you mean promoting to `exp(double)` and then converting the result back to `float` wouldn't give you `0.5` float ULP? (A double ulp is much smaller than a float ulp). That's potentially true, though, because of double rounding of the last calculation to `double`, and then from there to `float`. Anyway yes, 0.5ulp is hard, and 1ulp would be a more reasonable goal. – Peter Cordes Nov 14 '18 at 00:30
  • @PeterCordes My comment was not clear indeed. What I meant is that it is hard to achieve correctly rounded results, illustrated by a not so well chosen example: Function `float expf_accurate(float x){return(float)exp((double)x);}`, with a 1-ulp to 2-ulp double precision `exp` does not necessarily lead to a 0.5ulp single precision `expf_accurate`. Actually this is not a good example, because [there are only four billion floats](https://randomascii.wordpress.com/2014/01/27/theres-only-four-billion-floatsso-test-them-all/), – wim Nov 14 '18 at 14:01
  • (cont'd) so in practice it should be possible to test them all against a correctly rounded library function `expf`. Maybe it passes the test, maybe not. In practice developing correctly rounded math libraries, such as [this one](https://hal-ens-lyon.archives-ouvertes.fr/ensl-01529804/file/crlibm.pdf), for example, is indeed much more complex than developing a standard 1-ulp or 2-ulp library. – wim Nov 14 '18 at 14:01
  • 1
    Ok, better-than-1ulp was kind of a tangent. I was thinking that glibc scalar math functions actually were 0.5ulp at a large speed cost, but I think you're right that they're not *that* good. Still, the question is whether scalar `expf` is less accurate than scalar `_expf_finite` (non-vectorized `-ffast-math`), and/or vector `_ZGVdN8v___expf_finite`. I thought `expf` and `_expf_finite` gave the same results for finite values (and that scalar `_expf_finite` was actually used internally by `expf`), but I'm not sure and haven't actually checked. – Peter Cordes Nov 14 '18 at 14:07
  • 1
    Yes, the question about the accuracy of `expf` vs. `expf_finite` vs. `_ZGVdN8v___expf_finite` is quite interesting. Maybe i'll have time to figure this out later on. – wim Nov 14 '18 at 15:33
  • Since the OP did not specify the compiler your answer would be more complete if you showed how to do this with ICC and Clang as well. ICC should do this with SVML and clang using `-fveclib`. It would be interesting to see what clang does. I have only [seen](https://stackoverflow.com/q/52562272/2542702) `-fveclib=SVML` but maybe other libraries work as well. I did not find much documentation on this. – Z boson Nov 15 '18 at 08:21
  • 2
    Apparently there is also `-fveclib=Accelerate` which generates calls to `vexpf` https://godbolt.org/z/iWJDcx I guess it's an Apple thing https://developer.apple.com/documentation/accelerate/simd – Z boson Nov 15 '18 at 08:26
  • 1
    Clang should provide an option `-fveclib=libmvec`. – Z boson Nov 15 '18 at 08:33
  • @wim, -Ofast does not guarantee to maintain IEEE/ISO standard. Also it enables unsafe math calculations. In my case, it is not acceptable. Can you suggest anything else which maintians IEEE/ISO standard and still provide vectorization ? – mandar s Nov 15 '18 at 11:31
  • 1
    The Intel compiler icc seems to be able to generate high accuracy SIMD code, [see here](https://godbolt.org/z/frGjNx), so without Ofast. Without -fp-model precise it generates calls to `__svml_expf8_l9`, which is probably not accurate either, but with -fp-model precise `__svml_expf8_ha_l9` is used. Here `ha` means high accuracy. – wim Nov 15 '18 at 22:38
  • @Zboson I'm not very familiar with clang, but it seems to do a quite good job, thanks. – wim Nov 15 '18 at 22:45
  • 1
    @wim, note that ICC defaults to a looser floating point model (`-fp-model fast=1`) than GCC or CLANG. By default it does not conform with IEEE as it allows associative floating point math. Good observation about `ha` i was not aware of that. – Z boson Nov 16 '18 at 07:57