1

I want to vectorize a[i] = a[i-1] +c by AVX2 instructions. It seems its un vectorizable because of the dependencies. I've vectorized and want to share the answer here to see if there is any better answer to this question or my solution is good.

Amiri
  • 2,417
  • 1
  • 15
  • 42
  • 1
    Note that llvm/clang auto-vectorizes this scalar code, you might want to compare your solution to that (and possibly file bug reports for other compilers asking them to implement such an optimization). Also, writing `a[i]=b;b+=c;` is vectorized by gcc (though perhaps not optimally). – Marc Glisse Apr 10 '17 at 19:16
  • Thanks, I will check. Clang -O3 time : `0.000037 s` GCC -O3 time : `0.000107 s` – Amiri Apr 11 '17 at 00:47
  • My solution time is `0.000045 s` so `LLVM` is faster than my solution – Amiri Apr 11 '17 at 00:53
  • @MarcGlisse, I can not file bug report! `User account creation filtered due to spam.` – Amiri Apr 11 '17 at 15:45
  • 1
    The current way to create a user account is to send an email to overseers@gcc.gnu.org (they do it manually). – Marc Glisse Apr 11 '17 at 18:39

1 Answers1

1

I have implemented the following function for vectorizing this and it seems OK! The speedup is 2.5x over gcc -O3 Here is the solution:

// vectorized
inline void vec(int a[LEN], int b, int c)
{
    // b=1 and c=2 in this case
    int i = 0;
    a[i++] = b;//0 --> a[0] = 1
    //step 1:
    //solving dependencies vectorization factor is 8
    a[i++] = a[0] + 1*c; //1  --> a[1] = 1 + 2  = 3
    a[i++] = a[0] + 2*c; //2  --> a[2] = 1 + 4  = 5
    a[i++] = a[0] + 3*c; //3  --> a[3] = 1 + 6  = 7
    a[i++] = a[0] + 4*c; //4  --> a[4] = 1 + 8  = 9
    a[i++] = a[0] + 5*c; //5  --> a[5] = 1 + 10 = 11
    a[i++] = a[0] + 6*c; //6  --> a[6] = 1 + 12 = 13
    a[i++] = a[0] + 7*c; //7  --> a[7] = 1 + 14 = 15
    // vectorization factor reached
    // 8 *c will work for all 
    //loading the results to an vector
    __m256i dep1, dep2; //  dep = { 1,   3,  5, 7,  9,  11, 13, 15 }
    __m256i coeff = _mm256_set1_epi32(8*c); //coeff = { 16, 16, 16, 16, 16, 16, 16, 16 }

    for(; i<LEN-1; i+=16){

        dep1 = _mm256_load_si256((__m256i *) &a[i-8]);
        dep1 = _mm256_add_epi32(dep1, coeff);
        _mm256_store_si256((__m256i *) &a[i], dep1);    

        dep2 = _mm256_load_si256((__m256i *) &a[i]);
        dep2 = _mm256_add_epi32(dep2, coeff);
        _mm256_store_si256((__m256i *) &a[i+8], dep2);  

    }
}
Amiri
  • 2,417
  • 1
  • 15
  • 42