How to Vectorize using Intel Intrinsics in C programming language?

Question

I want to vectorize a c program.

I searched on the internet, YouTube but found very little (which was not helpful for beginner like me and most of them were about c++). Whatever little I understood, is that I have to use compiler intrinsics (which can be found in Intel Intrinsics Guide). I have an old machine which supports SSE 4.1, SSE 4.2 instruction.

But I can not move forward with the little knowledge I have, so my question is, how can I vectorize a c program?

As a demonstration, can you show how to optimize the following code:

float function(float* Array, int Initial, int Finishing_point)
{
    int k = 0;
    float VL = 0;
    for (int i = Initial; i < Finishing_point; i++)
    {
        k++;
        Vl = Vl + Array[i] * pow(2, k);
    }    
    return Vl;
}

Please note that, I need an introductory example, thus I am using an example that includes summation, array operation and other simple programming.

You probably should focus on learning C first. Vectorizing makes code faster, but you do that when you have good, correct code. When you have that (and not earlier), you can find the bottlenecks in your code, and vectorize only those. But chances are that your compiler will vectorize this code for you, so it won't be that bottleneck. — MSalters, Aug 11 '22 at 14:26
@MSalters thanks for the advice, but I already have the code and don't worry about whether I can detect the bottleneck, in this post I want to know how I can start vectorization in `c`, if you can help, plz do that, thanks. — Michael, Aug 11 '22 at 15:15
If you do have the running code, post a minimal but complete example For instance, we need to know the types of all variables. We don't need the whole program, a single function can show this logic But what you've posted now does not even compile. — MSalters, Aug 11 '22 at 15:27
Start by compiling with `-O3 -march=native` and have a look at what your compiler already does for you. — chtz, Aug 11 '22 at 16:20
@chtz no idea what you are talking about, plz refer to something more elaborate, and please undo close voting if you can. — Michael, Aug 11 '22 at 16:48
@MSalters undo the closing vote, it contains enough information, I need an introductory example that shows how to vectorize code related to array and loop, the main line is `Vl= Vl+Array[i]*pow(2,k); ` you can use whatever you want, if anything little require to be changed it can be done after you post an answer. — Michael, Aug 11 '22 at 16:51
If you don't know about compiler optimization options to see if it's already auto-vectorizing, learn about that first before you consider manually vectorizing with intrinsics. `-O3 -march=native` works for all compilers except MSVC. Optimization and target-arch options are necessary for intrinsics to be useful, too. See [How to remove "noise" from GCC/clang assembly output?](https://stackoverflow.com/q/38552116) re: looking at compiler-generated asm. — Peter Cordes, Aug 11 '22 at 17:31
You might also need `-ffast-math` to strength-reduce the `pow` to `power *= 2.0` and then vectorize to `*=16.0` in each of 4 elements. And of course to vectorize the summation into `Vl`, since FP math isn't associative. (Actually, strength-reduction of the pow can be done exactly, since float/double can represent powers of 2 exactly up to their exponent limit. But if any compilers are able to do it in practice, it might only be with -ffast-math, even if you separated it from the reduction that definitely needs fast-math.) — Peter Cordes, Aug 11 '22 at 17:36
Writing C code that can auto-vectorize is not different from C++. Presumably most tutorials would be about raw pointers, or looping over a `std::vector` which looks the same to the compiler. — Peter Cordes, Aug 11 '22 at 17:38
See also [How to vectorize with gcc?](https://stackoverflow.com/q/409300) re: optimization reports where GCC can tell you when it did/didn't vectorize a loop. And [C++ auto-vectorization requirements for gcc, clang and msvc](https://stackoverflow.com/q/59645037) re: msvc — Peter Cordes, Aug 11 '22 at 17:53
Here’s an example, untested: https://godbolt.org/z/PMnesxGbE — Soonts, Aug 12 '22 at 09:59
@Soonts thanks for your code, to see the output I tried `int main() {int A={1,2,3,4,5,6,7,8};printf("%d",computeThings( A, 0, 8 ));return 0;}` but see nothing, would you upvote and reopen the post and put a tested code, please? — Michael, Aug 13 '22 at 05:48
@Michael Your example should not compile. You can't initialize `int` with 8 values, nor pass `int` in place of `float*` pointer. — Soonts, Aug 14 '22 at 09:09
@Soonts: Huh, surprised neither GCC nor clang auto-vectorize this after manually strength-reducing the pow, even with `-O3 -ffast-math`. https://godbolt.org/z/bzao9nzx3 Interestingly, GCC *does* auto-vectorize a `powf(2, k)` version with a call to libmvec `_ZGVbN4v_expf` (exponential function), but's obviously going to be slower than a good manual vectorization. It can't run a huge number of iterations before overflowing the exponent range of a `float`, so out-of-order exec can hide some latency if there's any independent code after the loop, but some unrolling is still good. — Peter Cordes, Aug 14 '22 at 10:48

Soonts · Accepted Answer · 2022-08-14T12:51:06.720

Here’s the manually vectorized function, it requires SSE1 and SSE3.

#include <xmmintrin.h>  // SSE 1
#include <pmmintrin.h>  // SSE 3

float computeThings( const float* rsi, int idxFirst, int idxEnd )
{
    // Figure out the slice of the input array to consume
    size_t count = (size_t)( idxEnd - idxFirst );
    size_t countAligned = ( count / 4 ) * 4;
    rsi += idxFirst;
    const float* endAligned = rsi + countAligned;
    const float* end = rsi + count;

    // Process majority of inputs with SSE
    __m128 acc = _mm_setzero_ps();
    __m128 kexp = _mm_setr_ps( 2, 4, 8, 16 );
    for( ; rsi < endAligned; rsi += 4 )
    {
        __m128 v = _mm_loadu_ps( rsi );
        v = _mm_mul_ps( v, kexp );
        kexp = _mm_mul_ps( kexp, _mm_set1_ps( 16 ) );
        acc = _mm_add_ps( acc, v );
    }

    // Compute horizontal sum of the `acc` vector

    // acc.xyzw += acc.yyww
    acc = _mm_add_ps( acc, _mm_movehdup_ps( acc ) );
    // acc.x += acc.z
    acc = _mm_add_ss( acc, _mm_unpackhi_ps( acc, acc ) );

    // Process the remaining 0-3 numbers
    for( ; rsi < end; rsi++ )
    {
        __m128 v = _mm_load_ss( rsi );
        v = _mm_mul_ss( v, kexp );
        // kexp.x *= 2, computed as kexp.x += kexp.x
        kexp = _mm_add_ss( kexp, kexp );
        acc = _mm_add_ss( acc, v );
    }

    return _mm_cvtss_f32( acc );
}

Usage example:

float A[] = { 1,2,3,4,5,6,7,8 };
printf( "%g", computeThings( A, 0, 8 ) );

The `k` vector should be `2,4,8,16`. Note that the OP's code does `k++` before `pow`, so the first `pow(2,k)` is `2**1`, not `2**0`. Also, I'd use a var name like `kexp` or something, since the scalar code used plain `k` as an integer. And portable code will need to keep a pure C version. (Which could be optimized manually with strength-reduction at least.) — Peter Cordes, Aug 14 '22 at 12:46

How to Vectorize using Intel Intrinsics in C programming language?

1 Answers1