6

I have 700 items and I loop through the 700 items for each I obtain the item' three attributes and perform some basic calculations. I have implemented this using two techniques:

1) Three 700-element arrays, one array for each of the three attributes. So:

 item0.a = array1[0]
 item0.b = array2[0]
 item0.e = array3[0]

2) One 2100-element array containing data for the three attributes consecutively. So:

 item0.a = array[(0*3)+0]
 item0.b = array[(0*3)+1]
 item0.e = array[(0*3)+2]

Now the three item attributes a, b and e are used together within the loop- therefore it would make sense that if you store them in one array the performance should be better than if you use the three-array technique (due to spatial locality). However:

  • Three 700-element arrays = 3300 CPU cycles on average for the whole loop
  • One 2100-element array = 3500 CPU cycles on average for the whole loop

Here is the code for the 2100-array technique:

unsigned int x;
unsigned int y;
double c = 0;
double d = 0;
bool data_for_all_items = true;
unsigned long long start = 0;
unsigned long long finish = 0;
unsigned int array[2100];

//I have left out code for simplicity. You can assume by now the array is populated.

start = __rdtscp(&x);

for(int i=0; i < 700; i++){
    unsigned short j = i * 3;
    unsigned int a = array[j + 0];     
    unsigned int b = array[j + 1];     

    data_for_all_items = data_for_all_items & (a!= -1 & b != -1);

    unsigned int e = array[j + 2];    

    c += (a * e);
    d += (b * e);
}

finish = __rdtscp(&y);

and here is the code for the three 700-element arrays technique:

unsigned int x;
unsigned int y;
double c = 0;
double d = 0;
bool data_for_all_items = true;
unsigned long long start = 0;
unsigned long long finish = 0;
unsigned int array1[700];
unsigned int array2[700];
unsigned int array3[700];

//I have left out code for simplicity. You can assume by now the arrays are populated.

start = __rdtscp(&x);

for(int i=0; i < 700; i++){
    unsigned int a= array1[i];     //Array 1
    unsigned int b= array2[i];     //Array 2

    data_for_all_items = data_for_all_items & (a!= -1 & b != -1);

    unsigned int e = array3[i];    //Array 3

    c += (a * e);
    d += (b * e);
}

finish = __rdtscp(&y);

Why isn't the technique using one-2100 element array faster? It should be because the three attributes are used together, per each 700 item.

I used MSVC 2012, Win 7 64

Assembly for 3x 700-element array technique:

            start = __rdtscp(&x);
 rdtscp  
 shl         rdx,20h  
 lea         r8,[this]  
 or          rax,rdx  
 mov         dword ptr [r8],ecx  
 mov         r8d,8ch  
 mov         r9,rax  
 lea         rdx,[rbx+0Ch]  

            for(int i=0; i < 700; i++){
 sub         rdi,rbx  
                unsigned int a = array1[i];
                unsigned int b = array2[i];

                data_for_all_items = data_for_all_items & (a != -1 & b != -1);
 cmp         dword ptr [rdi+rdx-0Ch],0FFFFFFFFh  
 lea         rdx,[rdx+14h]  
 setne       cl  
 cmp         dword ptr [rdi+rdx-1Ch],0FFFFFFFFh  
 setne       al  
 and         cl,al  
 cmp         dword ptr [rdi+rdx-18h],0FFFFFFFFh  
 setne       al  
 and         cl,al  
 cmp         dword ptr [rdi+rdx-10h],0FFFFFFFFh  
 setne       al  
 and         cl,al  
 cmp         dword ptr [rdi+rdx-14h],0FFFFFFFFh  
 setne       al  
 and         cl,al  
 cmp         dword ptr [rdx-20h],0FFFFFFFFh  
 setne       al  
 and         cl,al  
 cmp         dword ptr [rdx-1Ch],0FFFFFFFFh  
 setne       al  
 and         cl,al  
 cmp         dword ptr [rdx-18h],0FFFFFFFFh  
 setne       al  
 and         cl,al  
 cmp         dword ptr [rdx-10h],0FFFFFFFFh  
 setne       al  
 and         cl,al  
 cmp         dword ptr [rdx-14h],0FFFFFFFFh  
 setne       al  
 and         cl,al  
 and         r15b,cl  
 dec         r8  
 jne         013F26DA53h  

                unsigned int e = array3[i];

                c += (a * e);
                d += (b * e);
            }


            finish = __rdtscp(&y);
 rdtscp  
 shl         rdx,20h  
 lea         r8,[y]  
 or          rax,rdx  
 mov         dword ptr [r8],ecx 

Assembler for the 2100-element array technique:

            start = __rdtscp(&x);
 rdtscp  
 lea         r8,[this]  
 shl         rdx,20h  
 or          rax,rdx  
 mov         dword ptr [r8],ecx  

            for(int i=0; i < 700; i++){
 xor         r8d,r8d  
 mov         r10,rax  
                unsigned short j = i*3;
 movzx       ecx,r8w  
 add         cx,cx  
 lea         edx,[rcx+r8]  
                unsigned int a = array[j + 0];
                unsigned int b = array[j + 1];

                data_for_all_items = data_for_all_items & (best_ask != -1 & best_bid != -1);
 movzx       ecx,dx  
 cmp         dword ptr [r9+rcx*4+4],0FFFFFFFFh  
 setne       dl  
 cmp         dword ptr [r9+rcx*4],0FFFFFFFFh  
 setne       al  
 inc         r8d  
 and         dl,al  
 and         r14b,dl  
 cmp         r8d,2BCh  
 jl          013F05DA10h  

                unsigned int e = array[pos + 2];

                c += (a * e);
                d += (b * e);
            }


            finish = __rdtscp(&y);
 rdtscp  
 shl         rdx,20h  
 lea         r8,[y]  
 or          rax,rdx  
 mov         dword ptr [r8],ecx  
user997112
  • 29,025
  • 43
  • 182
  • 361
  • 4
    You're gonna want to loop the test many times. Running one iteration is not enough to filter out all the noise. – Mysticial Apr 18 '14 at 21:13
  • 1
    I love the smell of premature optimisation in the morning. – Alan Stokes Apr 18 '14 at 21:30
  • The structure of arrays as opposed to an array of structures is known to result in better vectorisation. Examine the assembly output in both cases - the second one should be vectorised. – Hristo Iliev Apr 18 '14 at 21:33
  • Why would it be faster? your 3 arrays are contiguous in memory just like your 2100 item array is – cppguy Apr 18 '14 at 21:34
  • @Mysticial I have run the loop thousands of times. The results I put are an average. – user997112 Apr 18 '14 at 21:35
  • @cppguy Ok then- lets assume array1 begins at address 0, array2 begins 2800, array3 begins 5600. When I load the cache line for array1, it only contains 4 useful bytes. The next set of useful bytes are located 2800 bytes away. However, when I have the 2,100-element vector each cache line contains (at least) 12 useful bytes (representing the 3 variables). – user997112 Apr 18 '14 at 21:39
  • Why are you using the non-short-circuiting AND? – Ben Voigt Apr 18 '14 at 21:57
  • 3
    The compiler appears to have optimized away the calculation of `c` and `d`, and took out the access to `array3` along with it. So you're really comparing 2100 vs 2 * 700. – harold Apr 18 '14 at 22:12

4 Answers4

2

Edit: Given your assembly code, the second loop is five times unrolled. The unrolled version could run faster on an out-of-order execution CPU such as any modern x86/x86-64 CPU.


The second code is vectorisable - two elements of each array could be loaded at each iteration in one XMM register each. Since modern CPUs use SSE for both scalar and vector FP arithmetic, this cuts the number of cycles roughly in half. With an AVX-capable CPU four doubles could be loaded in an YMM register and therefore the number of cycles should be cut in four.

The first loop is not vectorisable along i since the value of a in iteration i+1 comes from a location 3 elements after the one where the value of a in iteration i comes from. In that case vectorisation requires gathered vector loads are those are only supported in the AVX2 instruction set.

Using proper data structures is crucial when programming CPUs with vector capabilities. Converting codes like your first loop into something like your second loop is 90% of the job that one has to do in order to get good performance on Intel Xeon Phi which has very wide vector registers but awfully slow in-order execution engine.

Hristo Iliev
  • 72,659
  • 12
  • 135
  • 186
  • Why doesn't a[i+1] follow a[i] in memory? – user997112 Apr 18 '14 at 21:49
  • Bad notation, read `a[i]` as `a` in iteration `i`. In that case `a[i]` is `array[i*3]` while `a[i+1]` is `array[(i+1)*3] = array[i*3 + 3]`. – Hristo Iliev Apr 18 '14 at 21:51
  • ah ok I got you. so are you saying that if vectorisation wasn't possible- my thinking was correct that the 3x variables would be loaded in to the cache closer. However, because of vectorisation the CPU can load the three arrays simultaneously? Have I got that correct? Surely the three arrays still need to come through the L1 cache though, so we still care about spatial locality? – user997112 Apr 18 '14 at 21:53
  • +1 While I can't verify it atm, it sounds the most reasonable given the VS2012 is the first version of the compiler that has auto-vectorization. – Mysticial Apr 18 '14 at 21:55
  • @user997112, your arrays take collectively 8,2 KiB. They even fit in the L1 data cache. – Hristo Iliev Apr 18 '14 at 21:57
  • I'll post the assembly for the faster version in my Q. – user997112 Apr 18 '14 at 21:57
  • Have posted the assembly in opening Q. – user997112 Apr 18 '14 at 22:03
  • Of course, I should have paid more attention and notice that you are working with `int` and not with `double`, therefore the number of vector elements is twice as much and the packing/unpacking is what makes the ratio of instruction cycles not close to 4. But this does not change the basic idea. – Hristo Iliev Apr 18 '14 at 22:03
  • Could you please post the assembly code for both versions? It does not appear that the second one is vectorised, therefore the reason must lay somewhere else. – Hristo Iliev Apr 18 '14 at 22:09
  • While the code is not vectorised, the second loop version appears to be 5x unrolled. This saves on increments of the loop counter `r8` and its expansion into the array indexer. – Hristo Iliev Apr 18 '14 at 22:24
  • Harold also noted the compiler has optimized-away the third array access – user997112 Apr 18 '14 at 22:29
  • Technically, one does not need gather to vectorize the single array code, permute operations would also work. (For the compiler to vectorize this, it would also need to recognize that the accumulators `c` and `d` can be replicated and summed at the end.) –  Apr 18 '14 at 22:33
  • It's optimised out in both cases and the arrays fit in the L1 cache, hence it doesn't really make for much of a difference. – Hristo Iliev Apr 18 '14 at 22:34
  • @HristoIliev agreed! I have added a std::cout << c << d << std::endl; to prevent optimization. The 2100-element array runs in 3600 CPU cycles and the 3x 700-element array runs in 3100 CPU cycles. I am still investigating.... – user997112 Apr 19 '14 at 00:25
  • It's definitely the effect of the [loop unrolling](http://en.wikipedia.org/wiki/Loop_unwinding) performed by the compiler. You could try and apply a manual 5x unrolling to the first loop and see how that changes the number of cycles. – Hristo Iliev Apr 19 '14 at 09:00
  • @HristoIliev I did, it made no difference (which was confusing). I am going to start a new question on this as this one is starting to get a little too long. I will post you a link here to the new question. – user997112 Apr 19 '14 at 15:33
  • http://stackoverflow.com/questions/23172567/code-accessing-continuous-l1-cache-memory-slower-than-code-accessing-15x-non-con – user997112 Apr 19 '14 at 16:31
2

The simple answer is that version 1 is SIMD friendly and version 2 is not. However, it's possible to make version 2, the 2100 element array, SIMD friendly. You need to us a Hybrid Struct of Arrays, aka an Array of Struct of Arrays (AoSoA). You arrange the array like this: aaaa bbbb eeee aaaa bbbb eeee ....

Below is code using GCC's vector extensions to do this. Note that now the 2100 element array code looks almost the same as the 700 element array code but it uses one array instead of three. And instead of having 700 elements between a b and e there are only 12 elements between them.

I did not find an easy solution to convert uint4 to double4 with the GCC vector extensions and I don't want to spend the time to write intrinics to do this right now so I made c and v unsigned int but for performance I would not want to be converting uint4 to double 4 in a loop anyway.

typedef unsigned int uint4 __attribute__ ((vector_size (16)));
//typedef double double4 __attribute__ ((vector_size (32)));  
uint4 zero = {};
unsigned int array[2100];
uint4 test = -1 + zero;
//double4 cv = {};
//double4 dv = {};
uint4 cv = {};
uint4 dv = {};

uint4* av = (uint4*)&array[0];     
uint4* bv = (uint4*)&array[4];
uint4* ev = (uint4*)&array[8];  

for(int i=0; i < 525; i+=3) { //525 = 2100/4 = 700/4*3
    test = test & ((av[i]!= -1) & (bv[i] != -1));     
    cv += (av[i] * ev[i]);
    dv += (bv[i] * ev[i]);
}
double c = cv[0] + cv[1] + cv[2] + cv[3];
double v = dv[0] + dv[1] + dv[2] + dv[3];
bool data_for_all_items = test[0] & test[1] & test[2] & test[3];
Z boson
  • 32,619
  • 11
  • 123
  • 226
1

The concept of 'spatial locality' is throwing you off a little bit. Chances are that with both solutions, your processor is doing its best to cache the arrays.

Unfortunately, version of your code that uses one array also has some extra math which is being performed. This is probably where your extra cycles are being spent.

AudioGL
  • 494
  • 4
  • 18
  • A multiply takes about 4-10 CPU cycles though- the difference is 1100 CPU cycles.... – user997112 Apr 18 '14 at 21:40
  • Remember that you're running that multiply operation 700 times! – AudioGL Apr 18 '14 at 21:45
  • Indeed. It's important to note that CPUs are very good at recognizing sequential access and will concurrently prefetch *multiple* locations. Splitting it out into 3 arrays isn't going to lower bandwidth. – Cory Nelson Apr 19 '14 at 01:27
1

Spatial locality is indeed useful, but it's actually helping you on the second case (3 distinct arrays) much more.

The cache line size is 64 Bytes (note that it doesn't divide in 3), so a single access to a 4 or 8 byte value is effectively prefetching the next elements. In addition, keep in mind that the CPU HW prefetcher is likely to go on and prefetch ahead even further elements.

However, when a,b,e are packed together, you're "wasting" this valuable prefetching on elements of the same iteration. When you access a, There's no point in prefetching b and e - the next loads are already going there (and would likely just merge in the CPU with the first load or wait for it to retrieve the data). In fact, when the arrays are merged - you fetch a new memory line only once per 64/(3*4)=~5.3 iterations. The bad alignment even means that on some iterations you'll have a and maybe b long before you get e, this imbalance is usually bad news.

In reality, since the iterations are independent, your CPU would go ahead and start the second iteration relatively fast thanks to the combination of loop unrolling (in case it was done) and out-of-order execution (calculating the index for the next set of iterations is simple and has no dependencies on the loads sent by the last ones). However you would have to run ahead pretty far in order to issue the next load everytime, and eventually the finite size of CPU instruction queues will block you, maybe before reaching the full potential memory bandwidth (number of parallel outstanding loads).

The alternative option on the other hand, where you have 3 distinct arrays, uses the spatial locality / HW prefetching solely across iterations. On each iteration, you'll issue 3 loads, which would fetch a full line once every 64/4=16 iterations. The overall data fetched is the same (well, it's the same data), but the timeliness is much better because you fetch ahead for the next 16 iterations instead of the 5. The difference become even bigger when HW prefetching is involved because you have 3 streams instead of one, meaning you can issue more prefetches (and look even further ahead).

Leeor
  • 19,260
  • 5
  • 56
  • 87