How to reach AVX computation throughput for a simple loop

Question

Recently I am working on a numerical solver on computational Electrodynamics by Finite difference method.

The solver was very simple to implement, but it is very difficult to reach the theoretical throughput of modern processors, because there is only 1 math operation on the loaded data, for example:

    #pragma ivdep
    for(int ii=0;ii<Large_Number;ii++)
    { Z[ii]  = C1*Z[ii] + C2*D[ii];}

Large_Number is about 1,000,000, but not bigger than 10,000,000

I have tried to manually unroll the loop and write AVX code but failed to make it faster:

int Vec_Size    = 8;
int Unroll_Num  = 6;
int remainder   = Large_Number%(Vec_Size*Unroll_Num);
int iter        = Large_Number/(Vec_Size*Unroll_Num);
int addr_incr   = Vec_Size*Unroll_Num;

__m256 AVX_Div1, AVX_Div2, AVX_Div3, AVX_Div4, AVX_Div5, AVX_Div6;
__m256 AVX_Z1,   AVX_Z2,   AVX_Z3,   AVX_Z4,   AVX_Z5,   AVX_Z6;

__m256 AVX_Zb = _mm256_set1_ps(Zb);
__m256 AVX_Za = _mm256_set1_ps(Za);
for(int it=0;it<iter;it++)
{
    int addr    = addr + addr_incr;
    AVX_Div1    = _mm256_loadu_ps(&Div1[addr]);     
    AVX_Z1      = _mm256_loadu_ps(&Z[addr]);
    AVX_Z1      = _mm256_add_ps(_mm256_mul_ps(AVX_Zb,AVX_Div1),_mm256_mul_ps(AVX_Za,AVX_Z1));
    _mm256_storeu_ps(&Z[addr],AVX_Z1);

    AVX_Div2    = _mm256_loadu_ps(&Div1[addr+8]);
    AVX_Z2      = _mm256_loadu_ps(&Z[addr+8]);
    AVX_Z2      = _mm256_add_ps(_mm256_mul_ps(AVX_Zb,AVX_Div2),_mm256_mul_ps(AVX_Za,AVX_Z2));
    _mm256_storeu_ps(&Z[addr+8],AVX_Z2);

    AVX_Div3    = _mm256_loadu_ps(&Div1[addr+16]);
    AVX_Z3      = _mm256_loadu_ps(&Z[addr+16]);
    AVX_Z3      = _mm256_add_ps(_mm256_mul_ps(AVX_Zb,AVX_Div3),_mm256_mul_ps(AVX_Za,AVX_Z3));
    _mm256_storeu_ps(&Z[addr+16],AVX_Z3);

    AVX_Div4    = _mm256_loadu_ps(&Div1[addr+24]);
    AVX_Z4      = _mm256_loadu_ps(&Z[addr+24]);
    AVX_Z4      = _mm256_add_ps(_mm256_mul_ps(AVX_Zb,AVX_Div4),_mm256_mul_ps(AVX_Za,AVX_Z4));
    _mm256_storeu_ps(&Z[addr+24],AVX_Z4);

    AVX_Div5    = _mm256_loadu_ps(&Div1[addr+32]);
    AVX_Z5      = _mm256_loadu_ps(&Z[addr+32]);
    AVX_Z5      = _mm256_add_ps(_mm256_mul_ps(AVX_Zb,AVX_Div5),_mm256_mul_ps(AVX_Za,AVX_Z5));
    _mm256_storeu_ps(&Z[addr+32],AVX_Z5);

    AVX_Div6    = _mm256_loadu_ps(&Div1[addr+40]);
    AVX_Z6      = _mm256_loadu_ps(&Z[addr+40]);
    AVX_Z6      = _mm256_add_ps(_mm256_mul_ps(AVX_Zb,AVX_Div6),_mm256_mul_ps(AVX_Za,AVX_Z6));
    _mm256_storeu_ps(&Z[addr+40],AVX_Z6);
}

The above AVX loop is actually a bit slower than the Inter compiler generated code.

The compiler generated code can reach about 8G flops/s, about 25% of the single thread theoretical throughput of a 3GHz Ivybridge processor. I wonder if it is even possible to reach the throughput for the simple loop like this.

Thank you!

If your compiler support, try compiler specific `restrict` keyword on all pointers. If not, write and compile in C99 mode and use standard `restrict` keyword. Also, align all pointers, and tell compiler that they are aligned. Also, forget about the remainder, ask the input to be padded instead. — user3528438, May 05 '15 at 00:03
Thanks for your comments, all arrays are aligned with 64bytes by using _mm_malloc(size,64), and #pragma ivdep was used instead of restrict to vectorize the for loop. — PhD AP EcE, May 05 '15 at 00:48
I'm seeing a boatload of unaligned loads and stores (`_mm256_loadu_ps`, `_mm256_storeu_ps`). Can the data not be aligned in favor of replacing these intrinsics with the aligned versions (`_mm256_load_ps`, `_mm256_store_ps`)? — , May 05 '15 at 18:32

score 2 · Accepted Answer · edited Apr 13 '17 at 12:53

2

Improving performance for the codes like yours is "well explored" and still popular area. Take a look at dot-product (perfect link provided by Z Boson already) or at some (D)AXPY optimization discussions (https://scicomp.stackexchange.com/questions/1932/are-daxpy-dcopy-dscal-overkills)

In general , key topics to explore and consider applying are:

AVX2 advantage over AVX due to FMA and better load/store ports u-architecture on Haswell
Pre-Fetching. "Streaming stores", "non-temporal stores" for some platforms.
Threading parallelism to reach max sustained bandwidth
Unrolling (already done by you; Intel Compiler is also capable to do that with #pragma unroll (X) ). Not a big difference for "streaming" codes.
Finally deciding what is a set of hardware platforms you want to optimize your code for

Last bullet is particularly important, because for "streaming" and overall memory-bound codes - it's important to know more about target memory-sybsystems; for example, with existing and especially future high-end HPC servers (2nd gen Xeon Phi codenamed Knights Landing as an example) you may have very different "roofline model" balance between bandwidth and compute, and even different techniques than in case of optimizing for average desktop machine.

edited Apr 13 '17 at 12:53

Community

1
1

answered May 05 '15 at 23:26

zam

1,664
9
16

Thank you! I am developing a research solver for plasmonic simulation, because the computation is memory bandwidth limited, only 8% of computation power of the 4 core Xeon is utilized. To squeeze last bit of the bandwidth by using all the tips above might improve the utilization to 10% (10 G flops/s vs 100 Gf/s of 3GHz Ivy). Apparently CPU is not suitable for high bandwidth computation. – PhD AP EcE May 06 '15 at 16:28
I guess this is why Intel decided not to include AVX512 into the latest Skylake processors, there is just not enough memory bandwidth to feed the execution units. – PhD AP EcE May 06 '15 at 16:32
I don't know which platforms you target, but generally Xeon-based platforms have better potential for memory-bandwidth-bound codes. Also as I mentioned, future Xeons and especially Xeon Phis should significantly change the balance in favor of memory bw bound workloads (even in presence of 512bits wide SIMD). – zam May 06 '15 at 23:59
I didn't see any announcements about "not including AVX512 in Skylakes". Also, there are enough somewhat compute-bound or latency bound workloads, which (after appropriate optimizations) still could benefit from AVX512 even with current memory bandwidth characteristics; so AVX512 will not be in vain anyway. – zam May 07 '15 at 00:02
AVX512 is not included in consumer Skylakes to be precise, it is included in the Xeon version. – PhD AP EcE May 07 '15 at 01:46

score 1 · Answer 2 · answered May 05 '15 at 02:02

1

Are you sure that 8 GFLOPS/s is about 25% of the maximum throughput of a 3 GHz Ivybridge processor? Let's do the calculations.

Every 8 elements require two single-precision AVX multiplications and one AVX addition. An Ivybridge processor can only execute one 8-wide AVX addition and one 8-wide AVX multiplication per cycle. Also since the addition is dependent on the two multiplications, then 3 cycles are required to process 8 elements. Since the addition can be overlapped with the next multiplication, we can reduce this to 2 cycles per 8 elements. For one billion elements, 2*10^9/8 = 10^9/4 cycles are required. Considering 3 GHz clock, we get 10^9/4 * 10^-9/3 = 1/12 = 0.08 seconds. So the maximum theoretical throughput is 12 GLOPS/s and the compiler-generated code is reaching 66%, which is fine.

One more thing, by unrolling the loop 8 times, it can be vectorized efficiently. I doubt that you'll gain any significant speed up if you unroll this particular loop more than that, especially more than 16 times.

answered May 05 '15 at 02:02

Hadi Brais

22,259
3
54
95

1

Thank you for your reply. As you said, if 2 cycles are required for 8 elements with 24 flops, that is actually 12 flops per cycle, which translates into 36G flops/s. the 25%estimation is more or less correct – PhD AP EcE May 05 '15 at 04:13
I think the real bottleneck is that there are 2 load and 1 store instructions for every 2 multiplication and 1 addition. Maybe the calculation is memory bandwidth limited. Every element requires transfer 12 bytes of data, and if 2G elements are processed every second (which is 6G flops) that is 24GB/s data transfer, reaching the theoretical bandwidth of ivy bridge. I wonder if this argument holds and there is indeed no solution to this question. – PhD AP EcE May 05 '15 at 04:44
Yes, you're right. This translates into 36 GLOPS/s. – Hadi Brais May 05 '15 at 04:50
What's the data rate of the DRAM memory channel? – Hadi Brais May 05 '15 at 05:11
25.6GB/s for dual channel DDR3 – PhD AP EcE May 05 '15 at 05:16
As you mentioned, using a GPU is better. But still, we don't know for sure that the memory bandwidth is the bottleneck here yet. – Hadi Brais May 05 '15 at 05:22
That is right. I was so disappointed about the performance. Hopefully some optimization Ninjia can show up :=). – PhD AP EcE May 05 '15 at 05:27

score 1 · Answer 3 · answered May 05 '15 at 04:51

I think the real bottleneck is that there are 2 load and 1 store instructions for every 2 multiplication and 1 addition. Maybe the calculation is memory bandwidth limited. Every element requires transfer 12 bytes of data, and if 2G elements are processed every second (which is 6G flops) that is 24GB/s data transfer, reaching the theoretical bandwidth of ivy bridge. I wonder if this argument holds and there is indeed no solution to this problem.

The reason why I am answering to my own question is to hope someone can correct me before I easily give up the optimization. This simple loop is EXTREMELY important for many scientific solvers, it is the backbone of finite element and finite difference method. If one cannot even feed one processor because the computation is memory bandwith limited, why bother with multicore? A high bandwith GPU or Xeon Phi should be better solutions.

Your loop is memory bandwidth bound. However, that does not mean you're getting the peak memory bandwidth. You need to use multiple cores for this. That's one reason to use multiple cores. See [measuring-memory-bandwidth-from-the-dot-product-of-two-arrays](https://stackoverflow.com/questions/25179738/measuring-memory-bandwidth-from-the-dot-product-of-two-arrays). — Z boson, May 05 '15 at 07:23

How to reach AVX computation throughput for a simple loop

3 Answers3