Why is this code slower when the array size is even?

Question

Warning: Actually it is not due to power of 2 but parity. See the EDIT section.

I have found a code which shows quite a strange behaviour.

The code uses a 2D array (size x size). When size is a power of 2, the code goes between 10% and 40% slower, the slowest being for size=32.

I have obtained these results with Intel compiler. If I compile with gcc 5.4, the power of 2 problem disappears but the code is 3x slower. Having tested it on different CPUs, I think it should be reproducible enough.

Code:

#define N 10000000

unsigned int tab1[TS][TS];

void simul1() {

  for(int i=0; i<TS; i++)
  for(int j=0; j<TS; j++) {

    if(i > 0)
      tab1[i][j] += tab1[i-1][j];

    if(j > 0)
      tab1[i][j] += tab1[i][j-1];
  }
}

int main() {

  for(int i=0; i<TS; i++)
  for(int j=0; j<TS; j++)
    tab1[j][i] = 0;


  for(int i=0; i<N; i++) {
    tab1[0][0] = 1;
    simul1();
  }

  return tab1[10][10];
}

Compilation:

icc:
        icc main.c -O3 -std=c99 -Wall -DTS=31 -o i31
        icc main.c -O3 -std=c99 -Wall -DTS=32 -o i32
        icc main.c -O3 -std=c99 -Wall -DTS=33 -o i33

gcc:
        gcc main.c -O3 -std=c99 -Wall -DTS=31 -o g31
        gcc main.c -O3 -std=c99 -Wall -DTS=32 -o g32
        gcc main.c -O3 -std=c99 -Wall -DTS=33 -o g33

Results on a Xeon E5:

time ./icc31
4.549s

time ./icc32
6.557s

time ./icc33
5.188s


time ./gcc31
13.335s

time ./gcc32
13.669s

time ./gcc33
14.399s

My questions:

Why is there a performance loss when the size of array is exactly 32 ?
(Optional: Why is icc 3x faster than gcc here ?)

EDIT: Actually this is due to parity, and only applies from size >= 32. The performance difference between even and odd numbers is consistent, and decreases when size becomes bigger. Here is a more detailed benchmark :

(The y axis is number of elements per second in millions, obtained with TS² * N / size / 1000000)

My CPU has a 32KB L1 cache and 256 KB L2

Do the times remain consistent is the order of execution is changed? — chux - Reinstate Monica, Mar 01 '17 at 15:43
for the bonus question: I just checked on godbolt: `icc` and `clang` generate vectorized code, while `gcc` doesn't — bolov, Mar 01 '17 at 15:45
How do you even benchmark this? Looks very fishy. I would strongly suspect that gcc -O3 will replace this whole program with `return `. Did you disassemble the optimized code? — Lundin, Mar 01 '17 at 15:47
Not really. The times vary about +/- 0.5 sec depending on order and the number of times I try the command. However, 32 is consistently slower. — Saiph, Mar 01 '17 at 15:47
To reason about the 3x difference between gcc and icc, you can look at the generated assembly. The assembly can be generated by passing the -S flag. — Ajay Brahmakshatriya, Mar 01 '17 at 15:48
Note that to compare times, only the sizeof of the arrays should change, Use `for(int i=0; i<31; i++) for(int j=0; j<31; j++)`. 32x32 loops should take 6% longer than 31x31. So the question is why is 33x33 faster than 32x32? — chux - Reinstate Monica, Mar 01 '17 at 16:01
You should check this thread for answers http://stackoverflow.com/questions/11413855/why-is-transposing-a-matrix-of-512x512-much-slower-than-transposing-a-matrix-of — Anty, Mar 01 '17 at 16:11
"this code slower when the array size is a power of 2?" only tests 1 power of 2. Running tests at (127,18,129), (65535,65536,65537), etc. would be illuminating. — chux - Reinstate Monica, Mar 01 '17 at 17:05
Your comment made me realize it is actually not due to powers of 2, but parity. I had tested other powers of 2 but only that. I'm going to make better measures and edit the post. — Saiph, Mar 01 '17 at 17:18
BTW: `tab1[i][j] += tab1[i - 1][j];` causes integer overflow, which is UB. Might want to fix that by using `unsigned tab1[TS][TS];` — chux - Reinstate Monica, Mar 01 '17 at 18:17
@chux Good idea. I checked it, the performance remains the same. — Saiph, Mar 02 '17 at 09:43

Grzegorz Szpetkowski · Accepted Answer · 2017-03-02T23:49:31.757

Why is icc 3x faster than gcc here?

GCC is not able to vectorize the inner loop, as it reports, that there are dependencies between data-refs. Intel's compiler is smart enough to split the inner loop into two independent parts:

for (int j = 1; j < TS; j++)
    tab1[i][j] += tab1[i-1][j];  // this one is vectorized

for (int j = 1; j < TS; j++)
    tab1[i][j] += tab1[i][j-1];

Attempt 1: Loops reorganization

You might get better performance in GCC by rewriting simul1 into:

void simul1(void)
{
    for (int j = 1; j < TS; j++)
        tab1[0][j] += tab1[0][j-1];

    for (int i = 1; i < TS; i++) {
        for (int j = 0; j < TS; j++)
            tab1[i][j] += tab1[i-1][j];
        for (int j = 1; j < TS; j++)
            tab1[i][j] += tab1[i][j-1];
    }
}

My results under GCC 6.3.0 with -O3 -march-native, TS = 32 powered by Intel Core i5 5200U are:

Original version:

real    0m21.110s
user    0m21.004s
sys 0m0.000s

Modified:

real    0m8.588s
user    0m8.536s
sys 0m0.000s

Attempt 2: Vectorization of prefix sum

After some reaseach, I found there is a possibility to vectorize the second inner loop by vector additions and shifts. Algorithm is presented here.

Version A: 128-bit wide vectors

#include "emmintrin.h"

void simul1(void)
{
    for (int j = 1; j < TS; j++)
        tab1[0][j] += tab1[0][j-1];

    for (int i = 1; i < TS; i++) {
        for (int j = 0; j < TS; j++)
            tab1[i][j] += tab1[i-1][j];

        for (int stride = 0; stride < TS; stride += 4) {
            __m128i v;
            v = _mm_loadu_si128((__m128i*) (tab1[i] + stride));
            v = _mm_add_epi32(v, _mm_slli_si128(v, sizeof(int)));
            v = _mm_add_epi32(v, _mm_slli_si128(v, 2*sizeof(int)));
            _mm_storeu_si128((__m128i*) (tab1[i] + stride), v);
        }

        for (int stride = 4; stride < TS; stride += 4)
            for (int j = 0; j < 4; j++)
                tab1[i][stride+j] += tab1[i][stride-1];
    }
}

Result:

real    0m7.541s
user    0m7.496s
sys 0m0.004s

Version B: 256-bit wide vectors

This one is more complicated. Consider eight-elements vector of ints:

V = (a, b, c, d, e, f, g, h)

We can treat it as two packed vectors:

(a, b, c, d), (e, f, g, h)

First, algorithm performs two independent summations:

(a, b, c, d), (e, f, g, h)
+
(0, a, b, c), (0, e, f, g)
=
(a, a+b, b+c, c+d), (e, e+f, f+g, g+h)
+
(0, 0, a, a+b), (0, 0, e, e+f)
=
(a, a+b, a+b+c, a+b+c+d), (e, e+f, e+f+g, e+f+g+h)

then it propagates last element of first vector into each element of second vector, so it finally yields:

(a, a+b, a+b+c, a+b+c+d), (a+b+c+d+e, a+b+c+d+e+f, a+b+c+d+e+f+g, a+b+c+d+e+f+g+h)

I suspect, that these intrinsics could be written better, so there is a potential for some improvement.

#include "immintrin.h"

void simul1(void)
{
    for (int j = 1; j < TS; j++)
        tab1[0][j] += tab1[0][j-1];

    for (int i = 1; i < TS; i++) {
        for (int j = 0; j < TS; j++)
            tab1[i][j] += tab1[i-1][j];

        for (int stride = 0; stride < TS; stride += 8) {
            __m256i v;
            v = _mm256_loadu_si256((__m256i*) (tab1[i] + stride));
            v = _mm256_add_epi32(v, _mm256_slli_si256(v, sizeof(int)));
            v = _mm256_add_epi32(v, _mm256_slli_si256(v, 2*sizeof(int)));

            __m256i t = _mm256_setzero_si256();
            t = _mm256_insertf128_si256(t, 
                _mm_shuffle_epi32(_mm256_castsi256_si128(v), 0xFF), 1);

            v = _mm256_add_epi32(v, t);
            _mm256_storeu_si256((__m256i*) (tab1[i] + stride), v);
        }

        for (int stride = 8; stride < TS; stride += 8)
            for (int j = 0; j < 8; j++)
                tab1[i][stride+j] += tab1[i][stride-1];
    }
}

Result (Clang 3.8):

real    0m5.644s
user    0m5.364s
sys 0m0.004s

Tested and you're right. My timings with gcc are now around 6 sec (on my CPU). It is slightly slower than Intel, but interestingly it is faster for size=32 than 31 and 33. Thanks for the very nice and detailed answer. — Saiph, Mar 02 '17 at 14:48
@Saiph: I found a way to vectorize the second inner loop. It's still WIP, but the timings are even more interesting. I will soon edit my answer. — Grzegorz Szpetkowski, Mar 02 '17 at 14:56
After your changes not only does the code run extremely fast (with both gcc/icc), but somehow there is no speed difference depending on size anymore. — Saiph, Mar 03 '17 at 16:55

score 5 · Answer 2 · answered Mar 01 '17 at 16:10

5

Looks like a classical case of cache contention. Your code is written such that there are operations on adjacent matrix rows and columns. This can be painful when the matrix rows align with cache lines, and would be stored in the same cache line.

But there's not a lot of data. If a line gets kicked out of fast L1 cache, it probably will still fit in the fairly fast L2 cache. L2 apparently is fast enough for the code that GCC emits, but L2 can't keep up with the (vectorized) code from ICC.

answered Mar 01 '17 at 16:10

MSalters

173,980
10
155
350

2

I do not know the exact l1 cache sizes of the processor used, but with TS=32/33 (4096b/4356b) it should easily fit into the l1-cache and since the space is contiguous, there is no need for thrashing with, lets say, an 8-way set associative cache or similar – Ctx Mar 01 '17 at 16:25
2

its not a case of fitting it is a case of thrashing (which is actually a not fitting case). say you had 8 cache slots in your cache and 3 bits in the address determined which slot to attempt to use. But if you have a lot of data activity that have addresses that dont change those three bits (several arrays but they all exercise that same chunk of cache) then only one fraction of your cache is used the rest is dormant. change to using prime number sized things or not powers of two now all the cache is used. – old_timer Mar 01 '17 at 18:18
@old_timer This is possible in theory, but in practice, an array will use a contiguous space in memory, so these three bits will not be the same. This is why the address is split into [TAG][SET][OFFSET], to be optimized for the contiguous case. Hence, thrashing should not be an issue here. – Ctx Mar 02 '17 at 01:00

Why is this code slower when the array size is even?

2 Answers2

Attempt 1: Loops reorganization

Attempt 2: Vectorization of prefix sum

Version A: 128-bit wide vectors

Version B: 256-bit wide vectors