performance of SSE and AVX when both Memory-band width limited

Question

In the code below I changed the "dataLen" and get different efficiency.

dataLen = 400 SSE time:758000 us AVX time:483000 us SSE > AVX

dataLen = 2400 SSE time:4212000 us AVX time:2636000 us SSE > AVX

dataLen = 2864 SSE time:6115000 us AVX time:6146000 us SSE ~= AVX

dataLen = 3200 SSE time:8049000 us AVX time:9297000 us SSE < AVX

dataLen = 4000 SSE time:10170000us AVX time:11690000us SSE < AVX

The SSE and AVX code can be both simplified into this: buf3[i] += buf1[1]*buf2[i];

#include "testfun.h"
#include <iostream>
#include <chrono>
#include <malloc.h>
#include "immintrin.h"
using namespace std::chrono;

void testfun()
{
int dataLen = 4000; 
int N = 10000000;
float *buf1 = reinterpret_cast<float*>(_aligned_malloc(sizeof(float)*dataLen, 32));
float *buf2 = reinterpret_cast<float*>(_aligned_malloc(sizeof(float)*dataLen, 32));
float *buf3 = reinterpret_cast<float*>(_aligned_malloc(sizeof(float)*dataLen, 32));
for(int i=0; i<dataLen; i++)
{
    buf1[i] = 1;
    buf2[i] = 1;
    buf3[i] = 0;
}
//=========================SSE CODE=====================================
system_clock::time_point SSEStart = system_clock::now();
__m128 p1, p2, p3;

for(int j=0; j<N; j++)
for(int i=0; i<dataLen; i=i+4)
{
    p1 = _mm_load_ps(&buf1[i]);
    p2 = _mm_load_ps(&buf2[i]);
    p3 = _mm_load_ps(&buf3[i]);
    p3 = _mm_add_ps(_mm_mul_ps(p1, p2), p3);
    _mm_store_ps(&buf3[i], p3);
}

microseconds SSEtimeUsed = duration_cast<milliseconds>(system_clock::now() - SSEStart);
std::cout << "SSE time used: " << SSEtimeUsed.count() << " us, " <<std::endl;

//=========================AVX　CODE=====================================
for(int i=0; i<dataLen; i++) buf3[i] = 0;

system_clock::time_point AVXstart = system_clock::now();
__m256  pp1, pp2, pp3; 

for(int j=0; j<N; j++)
for(int i=0; i<dataLen; i=i+8)
{       
    pp1 = _mm256_load_ps(&buf1[i]);
    pp2 = _mm256_load_ps(&buf2[i]);
    pp3 = _mm256_load_ps(&buf3[i]);
    pp3 = _mm256_add_ps(_mm256_mul_ps(pp1, pp2), pp3);
    _mm256_store_ps(&buf3[i], pp3);

}

microseconds AVXtimeUsed = duration_cast<milliseconds>(system_clock::now() - AVXstart);
std::cout << "AVX time used: " << AVXtimeUsed.count() << " us, " <<std::endl;

_aligned_free(buf1);
_aligned_free(buf2);
}

my cpu is Intel Xeon E3-1225 v2 which have a L1 cache 32KB*4 (4 core),when running this code it only uses 1 core, so the used L1 cache is 32KB.

buf1 buf2 and buf3 is small enough to located in L1 cache and L2 cache(L2 cache 1MB).Both of SSE and AVX is band width limited, but with the dataLen increase, Why do the AVX need more time than SSE?

Z boson · Accepted Answer · 2013-09-10T06:02:25.063

That's an interesting observation. I was able to reproduce your results. I manged to improve your SSE code speed quite a bit by unrolling the loop (see the code below). Now for SSE dataLen=2864 is clearly faster and for the smaller values it's nearlly as fast as AVX. For larger values it's ever faster still. This is due to the carried loop dependency in your SSE code (i.e. unrolling the loop increases the instruction level parallelism (ILP)). I did not try unrolling any further. Unrolling the AVX code did not help.

I don't have a clear answer to your question though. My hunch is that it's related to the ILP and the fact that AVX processors such as Sandy Bridge can only load two 128-bit words (SSE width) simultaneously and not two 256-bit words. So in the SSE code it can do one SSE addition, one SSE multiplication, two SSE loads, and one SSE store simultaneously. For AVX it can do one AVX load (through two 128-bit loads on ports 2 and 3), one AVX multiplication, one AVX addition, and one 128bit store (half the AVX width) simultaneous. In other words although with AVX the multiplication and additions do twice as much work as SSE the loads and stores are still 128bit wide. Maybe this leads to lower ILP with AVX compared to SSE sometimes with code dominated by loads and stores?

For more info on the ports and ILP see this Haswell, Sandy Bridge, Nehalem ports compared.

__m128 p1, p2, p3, p1_v2, p2_v2, p3_v2;
for(int j=0; j<N; j++)
    for(int i=0; i<dataLen; i+=8)
    {
        p1 = _mm_load_ps(&buf1[i]);
        p1_v2 = _mm_load_ps(&buf1[i+4]);
        p2 = _mm_load_ps(&buf2[i]);
        p2_v2 = _mm_load_ps(&buf2[i+4]);
        p3 = _mm_load_ps(&buf3[i]);
        p3_v2 = _mm_load_ps(&buf3[i+4]);
        p3 = _mm_add_ps(_mm_mul_ps(p1, p2), p3);
        p3_v2 = _mm_add_ps(_mm_mul_ps(p1_v2, p2_v2), p3_v2);
        _mm_store_ps(&buf3[i], p3);
        _mm_store_ps(&buf3[i+4], p3_v2);
    }

This program is indeded bounded by the memory band-width, whats more, in this situaion the AVX will be slower than the SSE.the 256 bit load of avx is slower than the 128 bit load of sse.May be we can call it a bug for the cpu! — myej, Sep 25 '13 at 10:26

score 1 · Answer 2 · answered Sep 10 '13 at 02:35

1

I think it's flaws of Sandy Bdrige architecture's cache system. I could reproduce same result on Ivy Brdige CPU, but not on Haswell CPUs, but haswell has same problem on aceessing L3. I think it's big flaws to AVX. Intel should fix this problem on next stepping or next architecture.

N = 1000000
datalen = 2000
SSE time used: 280000 us,
AVX time used: 156000 us,

N = 1000000
datalen = 4000 <- it's still fast on Haswell using L2
SSE time used: 811000 us,
AVX time used: 702000 us,

N = 1000000
datalen = 6000
SSE time used: 1216000 us,
AVX time used: 1076000 us,

N = 1000000
datalen = 8000
SSE time used: 1622000 us,
AVX time used: 1466000 us,

N = 100000  <- reduced
datalen = 20000 <- fit in L2 : 256K / 23 = 21845.3
SSE time used: 405000 us,
AVX time used: 374000 us,

N = 100000  
datalen = 40000 <- need L3
SSE time used: 1185000 us,
AVX time used: 1263000 us,

N = 100000  
datalen = 80000
SSE time used: 2340000 us,
AVX time used: 2527000 us,

answered Sep 10 '13 at 02:35

zupet

316
1
3

try unrolling the SSE loop once. The dependency will become even bigger. Also, `dataLen = 4000` fits in L1 so L3 should not be an issue. – Z boson Sep 10 '13 at 11:20
One calculation use 3 float point datas, so 32768 byte of L1 can hold 2730 components. – zupet Sep 11 '13 at 05:45
You're right. So it appears the discrepancy happens when going from L1 to L2. – Z boson Sep 11 '13 at 11:50
You're memory is 64 byte aligned, right ? Its interesting to see. – user1610743 Feb 01 '14 at 20:56
@Zboson: `dataLen = 4000` = 48 KB of data, larger than the 32 KB L1 data cache – netvope May 06 '14 at 09:27

performance of SSE and AVX when both Memory-band width limited

2 Answers2

Linked