From my tests, I get roughly 2x speed difference between the fused and split loops. This speed difference is very consistent no matter how I tweak the loop.
Fused: 1.096258 seconds
Split: 0.562272 seconds
(Refer to bottom for the full test code.)
Although I'm not 100% sure, I suspect that this is due to a combination of two things:
- Saturation of the load-store buffer for memory disambigutation due to the cache misses from
map[gethash(keys[i])]
.
- An added dependency in the fused loop version.
It's obvious that map[gethash(keys[i])]
will result in a cache miss nearly every time. In fact, it is probably enough to saturate the entire load-store buffer.
Now let's look at the added dependency. The issue is the ret
variable:
int check_fused(int * res, char * map, int n, int * keys){
int ret = 0;
for(int i = 0; i < n; ++i){
res[ret] = i;
ret += map[gethash(keys[i])];
}
return ret;
}
The ret
variable is needed for address resolution of the the store res[ret] = i;
.
- In the fused loop,
ret
is coming from a sure cache miss.
- In the split loop,
ret
is coming tmp[i]
- which is much faster.
This delay in address resolution of the fused loop case likely causes res[ret] = i
to store to clog up the load-store buffer along with map[gethash(keys[i])]
.
Since the load-store buffer has a fixed size, but you have double the junk in it:
You are only able to overlap the cache misses half as much as before. Thus 2x slow-down.
Suppose if we changed the fused loop to this:
int check_fused(int * res, char * map, int n, int * keys){
int ret = 0;
for(int i = 0; i < n; ++i){
res[i] = i; // Change "res" to "i"
ret += map[gethash(keys[i])];
}
return ret;
}
This will break the address resolution dependency.
(Note that it's not the same anymore, but it's just to demonstrate the performance difference.)
Then we get similar timings:
Fused: 0.487477 seconds
Split: 0.574585 seconds
Here's the complete test code:
#define SIZE 67108864
unsigned gethash(int key){
return key & (SIZE - 1);
}
int check_fused(int * res, char * map, int n, int * keys){
int ret = 0;
for(int i = 0; i < n; ++i){
res[ret] = i;
ret += map[gethash(keys[i])];
}
return ret;
}
int check_split(int * res, char * map, int n, int * keys, int *tmp){
int ret = 0;
for(int i = 0; i < n; ++i){
tmp[i] = map[gethash(keys[i])];
}
for(int i = 0; i < n; ++i){
res[ret] = i;
ret += tmp[i];
}
return ret;
}
int main()
{
char *map = (char*)calloc(SIZE,sizeof(char));
int *keys = (int*)calloc(SIZE,sizeof(int));
int *res = (int*)calloc(SIZE,sizeof(int));
int *tmp = (int*)calloc(SIZE,sizeof(int));
if (map == NULL || keys == NULL || res == NULL || tmp == NULL){
printf("Memory allocation failed.\n");
system("pause");
return 1;
}
// Generate Random Data
for (int i = 0; i < SIZE; i++){
keys[i] = (rand() & 0xff) | ((rand() & 0xff) << 16);
}
printf("Start...\n");
double start = omp_get_wtime();
int ret;
ret = check_fused(res,map,SIZE,keys);
// ret = check_split(res,map,SIZE,keys,tmp);
double end = omp_get_wtime();
printf("ret = %d",ret);
printf("\n\nseconds = %f\n",end - start);
system("pause");
}