Can't force inlining C++ function using Intel compiler

Question

I have a function defined as

inline void vec_add(__m512d &v3, const __m512d &v1, const __m512d &v2) {
    v3 = _mm512_add_pd(v1, v2);
}

(the __m512d is a native data type mapping to SIMD registers on Intel MIC architecture)

As this function is rather short and gets invoked frequently, I'd like it to be inlined at every invocation. But Intel's compiler seems reluctant to inline this function, even after I use the -inline-forceinline and -O3 options. It reports that 'Forceinline not honored for call ...' while compiling. As I have to use some compiler specific features, e.g. the __m512d type, Intel compiler is my only option.

More Info:

The file structure is quite simple. The function vec_add is defined in a header file mic.h, which is included in another file test.cc. Function vec_add is just invoked repeatedly in a loop, and there're no function pointers involved. A simplified version of the code in test.cc looks like this

for (int i = 0; i < LENGTH; i += 8) {
    // a, b, c are arrays of doubles, and each SIMD register can hold 8 doubles
    __mm512d va = _mm512_load_pd(a + i); // load SIMD register from memory
    __mm512d vb = _mm512_load_pd(b + i); // ditto
    __mm512d vc;
    vec_add(vc, va, vb); // store SIMD register to memory
    _mm512_store_pd(c + i, vc);
}

I've tried all kinds of hints, like __attribute__((always_inline)),__forceinline, and compiler option -inline-forceinline, none of which worked yet.

Complete code

I've put all the relevant code together in a simplified form. You can try it out if you have a Intel compiler. Use option -Winline to view inline reports and -inline-forceinline to force inlining.

#include <stdio.h>
#include <stdlib.h>
#include <immintrin.h>

#define LEN (1<<20)

__attribute((target(mic)))
inline void vec_add(__m512d &v3, const __m512d &v1, const __m512d &v2) {
    v3 = _mm512_add_pd(v1, v2);
}

int main() {
    #pragma offload target(mic)
    {
        double *a = (double*)_mm_malloc(LEN*sizeof(double), 64);
        double *b = (double*)_mm_malloc(LEN*sizeof(double), 64);
        double *c = (double*)_mm_malloc(LEN*sizeof(double), 64);

        for (int i = 0; i < LEN; i++) {
            a[i] = (double)rand()/RAND_MAX;
            b[i] = (double)rand()/RAND_MAX;
        }

        for (int i = 0; i < LEN; i += 8) {
            __m512d va = _mm512_load_pd(a + i);
            __m512d vb = _mm512_load_pd(b + i);
            __m512d vc;
            vec_add(vc, va, vb);
            _mm512_store_pd(c + i, vc);
        }

        _mm_free(a);
        _mm_free(b);
        _mm_free(c);
    }
}

Configurations

Compiler: Intel compiler(ICC) 14.0.2
Compile options: -O3 -inline-forceinline -Winline

Do you have any idea why this function can't be inlined? And how can I get it inlined after all(I don't want to turn to macros)?

Are you perchance taking the address of the function somewhere? — Frédéric Hamidi, May 15 '14 at 09:15
And have you tried using `inline __forceinline void vec_add(...)`? — Massa, May 15 '14 at 09:15
Including the file structure of your project (in which fille is the function, in which the caller and which headerfiles are included) — MikeMB, May 15 '14 at 09:19
Have you checked the assembly code if there is really a jump to your function? — MikeMB, May 15 '14 at 09:29
ICC implements a number of GCC extensions - try adding: `__attribute__ ((__always_inline__))` to the function specification. — Brett Hale, May 15 '14 at 09:31
@Angew Yes, the function definition is included from a header file, thus accessible at compile time. — lei_z, May 15 '14 at 09:32
Have you tried simplifying the call site? I mean, just call the function outside of a loop or other nesting. See if the compiler is willing to inline it at least in the most simple setup. — Angew is no longer proud of SO, May 15 '14 at 09:40
@BrettHale Do you mean `__attribute__((always_inline))`? Tried both, not working.. — lei_z, May 15 '14 at 09:41
@MikeMB No I haven't checked the assembly. But I've tried converting this function to a macro, and got a noticeable performance boost. So I'm rather sure the function is not inlined. — lei_z, May 15 '14 at 09:51
what if you mark the function inline static? Otherwise it will be required to have linkage? — paulm, May 15 '14 at 10:42
@lei.april That sounds reasonable, which unfortunately means that I've no Idea, why the compiler doesn't want to inline the function. However, as you're already using compiler specific types in the function interfaces, I wonder, why you want to put the call to _mm512_add_pd inside a function in the first place? — MikeMB, May 15 '14 at 10:47
@MikeMB Well, that's another story. There're two data types representing SIMD registers, `__m512d` for floating-points and `__m512i` for integers. In my example I only demonstrated the former. Actually I'd like to utilize function overloading to handle these two types with a single function name `vec_add`, and that's why I avoid using macros. — lei_z, May 15 '14 at 11:34
Read Assembly code. The whole discussion is useless unless you provide non-inlined Assembly code. BTW, I hope that you are talking about Release configuration with optimizations turned on. — Alex F, May 15 '14 at 11:56
One thing you could try is this: Pass the arguments by value and return the result via a return statement and see what happens (check again for inlining and execution speed). — MikeMB, May 15 '14 at 11:56
@AlexFarber As mentioned above, `-O3` is used in my configuration. The assembly is a bit overwhelming to me(shame on me..). Please see the code I've just posted and maybe you can compile and disassemble it. — lei_z, May 15 '14 at 12:10
@ Tried, no help. BTW, I've just posted a complete code snippet. You can try other ideas on it. — lei_z, May 15 '14 at 12:17
@lei.april The two important things about your code are the offloading pragma and `__attribute((target(mic)))` directive (see my answer). If you do stuff like this and your compiler doesn't behave as expected, you should really mention it in your question — MikeMB, May 15 '14 at 13:12
@MikeMB Never thought the pragma would cause such problem... The link you found is really helpful. Thanks a lot :) — lei_z, May 15 '14 at 14:29
@lei.april Sorry, maybe my last comment was a little harsh. After all: If we always knew what causes our problems, we would hardly need stackoverflow right? Anyway, glad I could help you. — MikeMB, May 15 '14 at 16:05

MikeMB · Accepted Answer · 2014-05-15T13:08:47.553

For some reason the Intel Compiler doesn't do inlining of functions in offloaded code (I'm not all that familiar with the concept, so I don't know what the technical reason for this is). See effective-use-of-the-intel-compilers-offload-features for more information (just search for "inline").

Quoting from the linked article:

Function Inlining into Offload Constructs

Sometimes inlining a function is necessary for optimum performance of the generated code. Functions called directly within a #pragma offload are not inlined by the compiler even if they are marked as inline. To enable optimum performance of code in offload regions, either manually inline functions, or place the entire offload construct into its own function.

...

One solution is to manually inline function f, as shown in function v2.

Another solution is to move the offload construct into its own function as shown in function v3.

If I understand this correctly, the best thing to do for you would be to place the loops into a separate function which is also marked with __attribute((target(mic))).

I presume that this is simply a limitation of the current implementation, and not a design intention. — pburka, May 18 '14 at 13:23

Can't force inlining C++ function using Intel compiler

1 Answers1