0

The question is about if the correct code (here only summing V1,V2, potentially having N~10 elements, which are not memory contiguous and have a relevant length each of 3) is the best-performant way to perform vectorized accumulation given that the variable to accumulate if not read from contiguous chunks of memory (and the hardware is Intel SkyLake).

In particular, should it be faster than actually accumulating the product with e.g. horizontal sum? (i.e. parallelizing in the 'other' dimension)

A working minimal code follows

#include <x86intrin.h>

#include <iostream>


int main(){

    // Initialization
    double coefficients[50][2][12];
    for (int i=0; i<4; i++) coefficients[0][0][i] = (double) 1; // C1_0 is 1
    for (int i=0; i<4; i++) coefficients[0][0][4+i] = (double) 2; // C2_0 is 2
    for (int i=0; i<4; i++) coefficients[0][0][8+i] = (double) 3; // C3_0 is 3
    for (int i=0; i<4; i++) coefficients[0][1][i] = (double) 4; // C1_1 is 4
    for (int i=0; i<4; i++) coefficients[0][1][4+i] = (double) 5; // C2_1 is 5
    for (int i=0; i<4; i++) coefficients[0][1][8+i] = (double) 6; // C3_2 is 6

    int ix1 = 13;
    int ix2 = 27;
    double u[50][3];
    
    u[ix1][0] = 1.0;
    u[ix1][1] = 2.0;
    u[ix1][2] = 3.0;

    u[ix2][0] = 4.0;
    u[ix2][1] = 5.0;
    u[ix2][2] = 6.0;


    
    // Step 1: load data
    //
    __m256d V1 = _mm256_load_pd(&u[ix1][0]);
    __m256d V2 = _mm256_load_pd(&u[ix2][0]);


    // Populate the vectorised coefficients
    __m256d Vcoeff_A_1 = _mm256_load_pd( &coefficients[0][0][0]);
    __m256d Vcoeff_B_1 = _mm256_load_pd( &coefficients[0][0][4]);
    __m256d Vcoeff_C_1 = _mm256_load_pd( &coefficients[0][0][8]);
    __m256d Vcoeff_A_2 = _mm256_load_pd( &coefficients[0][1][0]);
    __m256d Vcoeff_B_2 = _mm256_load_pd( &coefficients[0][1][4]);
    __m256d Vcoeff_C_2 = _mm256_load_pd( &coefficients[0][1][8]);


    // The accumulation refers to 
    //
    // resultA[i] = V1[i] * Vcoeff_A_1[i] + V2[i] * Vcoeff_A_2[i] + ... + potentially defining other vectors V3,V4,V5,...
    //
    __m256d resultA = _mm256_fmadd_pd( V1, Vcoeff_A_1,  _mm256_mul_pd(V2, Vcoeff_A_2));
    __m256d resultB = _mm256_fmadd_pd( V1, Vcoeff_B_1,  _mm256_mul_pd(V2, Vcoeff_B_2));
    __m256d resultC = _mm256_fmadd_pd( V1, Vcoeff_C_1,  _mm256_mul_pd(V2, Vcoeff_C_2));
}

The following is an example on how it would look an accumulation of size 12 (it does not compile)

   __m256d resultB = _mm256_fmadd_pd( Vu_0, Vcoeff_0_B256,  // 0
            _mm256_fmadd_pd( Vu_1, Vcoeff_1_B256, // 1
                _mm256_fmadd_pd( Vu_2, Vcoeff_2_B256, // 2
                    _mm256_fmadd_pd( Vu_3, Vcoeff_3_B256, // 3
                        _mm256_fmadd_pd( Vu_4, Vcoeff_4_B256, // 4
                            _mm256_fmadd_pd( Vu_5, Vcoeff_5_B256, // 5 
                                _mm256_fmadd_pd( Vu_6, Vcoeff_6_B256, // 6
                                    _mm256_fmadd_pd( Vu_7, Vcoeff_7_B256, // 7
                                        _mm256_fmadd_pd( Vu_8, Vcoeff_8_B256, // 8
                                            _mm256_fmadd_pd( Vu_9, Vcoeff_9_B256, // 9
                                                _mm256_fmadd_pd( Vu_10, Vcoeff_10_B256, // 10
                                                    _mm256_mul_pd( Vu_11, Vcoeff_11_B256 )  // 11
                                                    )))))))))));
Gaston
  • 537
  • 4
  • 10
  • I'm not familiar with icc, but can it auto-vectorise code? If so, you might just try writing the simplest C++ code you can and then take a look at the object code it produces. – Paul Sanders Jul 08 '22 at 17:34
  • IMHO, since your loops are only 4 iterations, consider unrolling them. All loops have overhead on setup, compare and continue. You can eliminate the overhead by unrolling the loops (essentially removing them). – Thomas Matthews Jul 08 '22 at 17:51
  • Sorry, the initialization section does not count. Initialization happens only once. – Gaston Jul 08 '22 at 17:55
  • 2
    I can already tell the alignment is all wrong and that ladder thing at the end means every call depends on the next. That will be horrible code. I bet if you write it in plain C++ it will be faster for some reason you aren't thinking off. More sure if you add alignment. – Goswin von Brederlow Jul 08 '22 at 18:08
  • 2
    Could you separate in your minimal example the part which you want to optimize (into a separate method) -- ideally also as non-vectorized code so it is absolutely clear what you want. Are your sizes known at compile time? Do you care for latency or only for throughput? – chtz Jul 08 '22 at 19:51
  • @PaulSanders intel compiler at level O3 reports no speedup for the autoVec case, which is indeed what is measured. – Gaston Jul 09 '22 at 00:11
  • @Gaston OK. Not sure how you're measuring performance (always a tricky business) but can you inspect the object code it generates somehow? – Paul Sanders Jul 09 '22 at 00:15
  • 2
    @PaulSanders: https://godbolt.org/ has ICC (classic) and ICX (OneAPI), so we can look at the asm ourselves. (At least after making `resultA` volatile or storing it somewhere so it won't be optimized away. And maybe moving the code into a function that takes args, instead of constants that would let most compilers do constant-propagation. Although I think ICC takes intrinsics so literally that it won't even constprop through `_mm256_mul_pd`) – Peter Cordes Jul 09 '22 at 01:21
  • @PeterCordes Ok, cheers. Not my area, any of this, but at least the OP can now get stuck in and experiment. – Paul Sanders Jul 09 '22 at 08:18
  • 1
    @Gaston: It's not great to be wasting 1 of the 4 elements of a `__m256d`, assuming that's how you're looping over `u[50][3]`, just using the low 3 elements of the vector as a whole row? But horizontal stuff within one vector would be even worse, so for a non-tiny number of coefficient constant vectors, this might be a good tradeoff (and something a compiler won't do; they hate inventing unnecessary work in an unused when vectorizing). – Peter Cordes Jul 09 '22 at 08:31
  • 2
    As long as the number of coefficients is small enough for OoO exec to overlap well, this chain of FMAs is a decent way to do it. Consider breaking into 2 chains that you add at the end for longer ones like 12, doubling the instruction-level parallelism within one "iteration". (See [Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)](https://stackoverflow.com/q/45113527) re: longer FMA chains in a dot product, which is basically what this is, or 3 interleaved dot products in parallel in separate elements) – Peter Cordes Jul 09 '22 at 08:33

0 Answers0