Performance improvement of math.h functions by rewriting with AVX intrinsics

Question

I have a simple math library that gets linked into a project which runs on simulator hardware (32 bit RTOS) and the compiler toolchain is based on a variant of GCC 5.5. The main project code is in Matlab but the core math operations (cmath functions on array data) are re-written in C for performance. Looking at Compiler Explorer, the quality of the optimised code does not seem great for GCC 5.5 32 bit (for reference: Clang trunk 32bit). From what I understand, Clang does a better job optimising the loops. One example code snippet:

...

void cfunctionsLog10(unsigned int n, const double* x, double* y) {
    int i;
    for (i = 0; i < n; i++) {
        y[i] = log10(x[i]);
    }
}

And the corresponding assembly generated by GCC 5.5

cfunctionsLog10(unsigned int, double const*, double*):
        push    ebp
        push    edi
        push    esi
        push    ebx
        sub     esp, 12
        mov     esi, DWORD PTR [esp+32]
        mov     ebp, DWORD PTR [esp+36]
        mov     edi, DWORD PTR [esp+40]
        test    esi, esi
        je      .L28
        xor     ebx, ebx
.L27:
        sub     esp, 8
        push    DWORD PTR [ebp+4+ebx*8]
        push    DWORD PTR [ebp+0+ebx*8]
        call    __log10_finite
        fstp    QWORD PTR [edi+ebx*8]
        add     ebx, 1
        add     esp, 16
        cmp     ebx, esi
        jne     .L27
.L28:
        add     esp, 12
        pop     ebx
        pop     esi
        pop     edi
        pop     ebp
        ret

where Clang produces:

cfunctionsLog10(unsigned int, double const*, double*):              # @cfunctionsLog10(unsigned int, double const*, double*)
        push    ebp
        push    ebx
        push    edi
        push    esi
        sub     esp, 76
        mov     esi, dword ptr [esp + 96]
        test    esi, esi
        je      .LBB2_8
        mov     edi, dword ptr [esp + 104]
        mov     ebx, dword ptr [esp + 100]
        xor     ebp, ebp
        cmp     esi, 4
        jb      .LBB2_7
        lea     eax, [ebx + 8*esi]
        cmp     eax, edi
        jbe     .LBB2_4
        lea     eax, [edi + 8*esi]
        cmp     eax, ebx
        ja      .LBB2_7
.LBB2_4:
        mov     ebp, esi
        xor     esi, esi
        and     ebp, -4
.LBB2_5:                                # =>This Inner Loop Header: Depth=1
        vmovsd  xmm0, qword ptr [ebx + 8*esi + 16] # xmm0 = mem[0],zero
        vmovsd  qword ptr [esp], xmm0
        vmovsd  xmm0, qword ptr [ebx + 8*esi] # xmm0 = mem[0],zero
        vmovsd  xmm1, qword ptr [ebx + 8*esi + 8] # xmm1 = mem[0],zero
        vmovsd  qword ptr [esp + 8], xmm0 # 8-byte Spill
        vmovsd  qword ptr [esp + 16], xmm1 # 8-byte Spill
        call    log10
        fstp    tbyte ptr [esp + 64]    # 10-byte Folded Spill
        vmovsd  xmm0, qword ptr [esp + 16] # 8-byte Reload
        vmovsd  qword ptr [esp], xmm0
        call    log10
        fstp    tbyte ptr [esp + 16]    # 10-byte Folded Spill
        vmovsd  xmm0, qword ptr [esp + 8] # 8-byte Reload
        vmovsd  qword ptr [esp], xmm0
        vmovsd  xmm0, qword ptr [ebx + 8*esi + 24] # xmm0 = mem[0],zero
        vmovsd  qword ptr [esp + 8], xmm0 # 8-byte Spill
        call    log10
        vmovsd  xmm0, qword ptr [esp + 8] # 8-byte Reload
        vmovsd  qword ptr [esp], xmm0
        fstp    qword ptr [esp + 56]
        fld     tbyte ptr [esp + 16]    # 10-byte Folded Reload
        fstp    qword ptr [esp + 48]
        fld     tbyte ptr [esp + 64]    # 10-byte Folded Reload
        fstp    qword ptr [esp + 40]
        call    log10
        fstp    qword ptr [esp + 32]
        vmovsd  xmm0, qword ptr [esp + 56] # xmm0 = mem[0],zero
        vmovsd  xmm1, qword ptr [esp + 40] # xmm1 = mem[0],zero
        vmovhps xmm0, xmm0, qword ptr [esp + 48] # xmm0 = xmm0[0,1],mem[0,1]
        vmovhps xmm1, xmm1, qword ptr [esp + 32] # xmm1 = xmm1[0,1],mem[0,1]
        vmovups xmmword ptr [edi + 8*esi + 16], xmm1
        vmovups xmmword ptr [edi + 8*esi], xmm0
        add     esi, 4
        cmp     ebp, esi
        jne     .LBB2_5
        mov     esi, dword ptr [esp + 96]
        cmp     ebp, esi
        je      .LBB2_8
.LBB2_7:                                # =>This Inner Loop Header: Depth=1
        vmovsd  xmm0, qword ptr [ebx + 8*ebp] # xmm0 = mem[0],zero
        vmovsd  qword ptr [esp], xmm0
        call    log10
        fstp    qword ptr [edi + 8*ebp]
        inc     ebp
        cmp     esi, ebp
        jne     .LBB2_7
.LBB2_8:
        add     esp, 76
        pop     esi
        pop     edi
        pop     ebx
        pop     ebp
        ret

As I am unable to use Clang directly, is there any value in re-writing the C source using AVX intrinsics. I think the majority of the performance cost comes from the cmath function calls, most of which do not have intrinsic implementations.

EDIT: Reimplemented using vectorclass library:

void vclfunctionsTanh(unsigned int n, const double* x, double* y) 
{
    const int N = n;
    const int VectorSize = 4;
    const int FirstPass = N & (-VectorSize);

    int i = 0;    
    for (; i < FirstPass; i+= 4)
    {
        Vec4d data = Vec4d.load(x[i]);
        Vec4d ans = tanh(data);
        ans.store(y+i);
    }

    
    for (;i < N; ++i)
        y[i]=std::tanh(x[i]);
}

First of all, include some actual code + asm you're talking about in your question, not just via Godbolt shortlinks. (And if you're not including everything, use "full" links to Godbolt to guard against bitrot). Also, use a newer GCC for more efficient auto-vectorization of sqrt, and/or `-mno-avx256-split-unaligned-load -mno-avx256-split-unaligned-store` (see [Why doesn't gcc resolve \_mm256\_loadu\_pd as single vmovupd?](//stackoverflow.com/q/52626726)). Also maybe try `-mfpmath=sse`? The calling convention still passed on the stack and returns in x87 though, unless you recompile libm. — Peter Cordes, Jun 11 '20 at 10:37
If you just need fast approximations of some of those functions, you can manually vectorize them and inline them with intrinsics; that could give a major speedup. Despite the inefficient mixing of AVX and x87 doing store/reload, most of the cost will be inside the math library functions, except for sqrt. So yes, doing 4 doubles in parallel could be at least 4x faster, or more if you trade some precision for speed. (And/or NaN checking; if you know your inputs are finite that can help a lot especially for SIMD where you can't branch per element) — Peter Cordes, Jun 11 '20 at 10:42
@PeterCordes I think fast approximations are the way to go if they provide concrete error bounds? — Mansoor, Jun 11 '20 at 11:41
Yes, depending on how you write them, they certainly can. log and exp are relatively easy to vectorize, and a polynomial approximation for the mantissa converges quickly. e.g. [Efficient implementation of log2(\_\_m256d) in AVX2](https://stackoverflow.com/q/45770089) / [Fastest Implementation of Exponential Function Using AVX](https://stackoverflow.com/q/48863719) and related Q&As. IDK about cos. For `float`, you can simply test every possible 32-bit float bit-pattern to be certain about min/max error. — Peter Cordes, Jun 11 '20 at 11:45
Also there are SIMD math libraries with existing implementations like [Where is Clang's '\_mm256\_pow\_ps' intrinsic?](https://stackoverflow.com/a/36637424) / [AVX log intrinsics (\_mm256\_log\_ps) missing in g++-4.8?](https://stackoverflow.com/q/18747768) / [Mathematical functions for SIMD registers](https://stackoverflow.com/q/40475140) — Peter Cordes, Jun 11 '20 at 11:46
What makes you think that rewriting math-functions in C will be faster than Matlab? (Unless, of course you manually implement faster, lower-precision functions) — chtz, Jun 11 '20 at 15:07
@chtz Matlab's core math function when code-generated into C code have bounds checks (and various other checks). — Mansoor, Jun 11 '20 at 15:25
Your C `cfunctionsTanh` is a different function from the asm you show, calling `cfunctionsLog10`. Also, clang's asm is actually shockingly bad, doing `tbyte` spill/reload of the return value to the stack instead of converting straight into the destination. 10-byte x87 store / reload is much slower than `double`, and it's totally unnecessary. — Peter Cordes, Jun 11 '20 at 17:30
A collection of functions is normally called a library :P. Some of the libraries linked from [Where is Clang's '\_mm256\_pow\_ps' intrinsic?](https://stackoverflow.com/a/36637424) have some. Apparently https://sleef.org/ has configurable precision; I haven't looked into it. But as far as collections of functions intended for hand-tweaking to apply any special-case assumptions your use case allows (like positive or at least non-negative, or normalized, or limited range), I don't know. — Peter Cordes, Jun 11 '20 at 17:51
Use a single SIMD load, not `data(x[i], x[i+1], x[i+2], x[i+3])`! Like in [Performance worsens when using SSE (Simple addition of integer arrays)](https://stackoverflow.com/q/24446516). VCL has a wrapper for loads separate from the 4-element constructor which you could use instead of `_mm256_loadu_pd`. — Peter Cordes, Jun 21 '20 at 22:21
@PeterCordes thanks for the help :). When moving the data back into the C arrays, is there a better approach? — Mansoor, Jun 22 '20 at 11:15
Yes, use a SIMD store or `storeu`, of course. VCL should have a store member-function. — Peter Cordes, Jun 22 '20 at 14:58

score 3 · Accepted Answer · answered Jun 12 '20 at 12:04

3

The vector class library has inline vector versions of common mathematical functions, including log10.

https://github.com/vectorclass/

answered Jun 12 '20 at 12:04

A Fog

4,360
1
30
32

The documentation recommends not compiling with `-ffast-math`, is this true for cases where we know that we deal only with finite numbers and operations that don't overflow? – Mansoor Jun 20 '20 at 11:49
1

-ffast-math makes it impossible to detect NAN results in most cases. If you do not want to check for NANs anyway then you can use -ffast-math. – A Fog Jun 21 '20 at 14:58

Performance improvement of math.h functions by rewriting with AVX intrinsics

1 Answers1