Why does loop fission make sense in this case?

Question

The code without fission looks like this:

int check(int * res, char * map, int n, int * keys){
    int ret = 0;
    for(int i = 0; i < n; ++i){
        res[ret] = i;
        ret += map[hash(keys[i])]
    }
    return ret;
}

With fission:

int check(int * res, char * map, int n, int * keys){
    int ret = 0;
    for(int i = 0; i < n; ++i){
        tmp[i] = map[hash(keys[i])];
    }
    for(int i = 0; i < n; ++i){
        res[ret] = i;
        ret += tmp[i];
    }
    return ret;
}

Notes:

The bottleneck is map[hash(keys[i])] which accesses memory randomly.
normally, it would be if(tmp[i]) res[ret++] = i; to avoid the if, I'm using ret += tmp[i].
map[..] is always 0 or 1

The fission version is usually significantly faster and I am trying to explain why. My best guess is that ret += map[..] still introduces some dependency and that prevents speculative execution.

I would like to hear if anyone has a better explanation.

Thought I'd mention this. Although this question looks very similar to [this question](http://stackoverflow.com/questions/8547778/why-is-one-loop-so-much-slower-than-two-loops), it does not look like a duplicate. — Mysticial, Jun 20 '12 at 16:12
I'm finally able to build a test case that reproduces your results... Now to see what I can make of it. — Mysticial, Jun 20 '12 at 16:35
@Mysticial you should be able to see that usually the fission code is much faster. it's only slower or just as fast when map is not very big, e.g. when it map, keys and res all fit in the cache — user16367, Jun 20 '12 at 16:49
I'm testing it with `n = 67108864` and `map` also with a size of `67108864`. I'm getting almost 2x difference in speed. — Mysticial, Jun 20 '12 at 16:50

score 8 · Accepted Answer · answered Jun 20 '12 at 17:47

From my tests, I get roughly 2x speed difference between the fused and split loops. This speed difference is very consistent no matter how I tweak the loop.

Fused: 1.096258 seconds
Split: 0.562272 seconds

(Refer to bottom for the full test code.)

Although I'm not 100% sure, I suspect that this is due to a combination of two things:

Saturation of the load-store buffer for memory disambigutation due to the cache misses from map[gethash(keys[i])].
An added dependency in the fused loop version.

It's obvious that map[gethash(keys[i])] will result in a cache miss nearly every time. In fact, it is probably enough to saturate the entire load-store buffer.

Now let's look at the added dependency. The issue is the ret variable:

int check_fused(int * res, char * map, int n, int * keys){
    int ret = 0;
    for(int i = 0; i < n; ++i){
        res[ret] = i;
        ret += map[gethash(keys[i])];
    }
    return ret;
}

The ret variable is needed for address resolution of the the store res[ret] = i;.

In the fused loop, ret is coming from a sure cache miss.
In the split loop, ret is coming tmp[i] - which is much faster.

This delay in address resolution of the fused loop case likely causes res[ret] = i to store to clog up the load-store buffer along with map[gethash(keys[i])].

Since the load-store buffer has a fixed size, but you have double the junk in it:
You are only able to overlap the cache misses half as much as before. Thus 2x slow-down.

Suppose if we changed the fused loop to this:

int check_fused(int * res, char * map, int n, int * keys){
    int ret = 0;
    for(int i = 0; i < n; ++i){
        res[i] = i;    //  Change "res" to "i"
        ret += map[gethash(keys[i])];
    }
    return ret;
}

This will break the address resolution dependency.

^{(Note that it's not the same anymore, but it's just to demonstrate the performance difference.)}

Then we get similar timings:

Fused: 0.487477 seconds
Split: 0.574585 seconds

Here's the complete test code:

#define SIZE 67108864

unsigned gethash(int key){
    return key & (SIZE - 1);
}

int check_fused(int * res, char * map, int n, int * keys){
    int ret = 0;
    for(int i = 0; i < n; ++i){
        res[ret] = i;
        ret += map[gethash(keys[i])];
    }
    return ret;
}
int check_split(int * res, char * map, int n, int * keys, int *tmp){
    int ret = 0;
    for(int i = 0; i < n; ++i){
        tmp[i] = map[gethash(keys[i])];
    }
    for(int i = 0; i < n; ++i){
        res[ret] = i;
        ret += tmp[i];
    }
    return ret;
}


int main()
{
    char *map = (char*)calloc(SIZE,sizeof(char));
    int *keys =  (int*)calloc(SIZE,sizeof(int));
    int *res  =  (int*)calloc(SIZE,sizeof(int));
    int *tmp  =  (int*)calloc(SIZE,sizeof(int));
    if (map == NULL || keys == NULL || res == NULL || tmp == NULL){
        printf("Memory allocation failed.\n");
        system("pause");
        return 1;
    }

    //  Generate Random Data
    for (int i = 0; i < SIZE; i++){
        keys[i] = (rand() & 0xff) | ((rand() & 0xff) << 16);
    }

    printf("Start...\n");

    double start = omp_get_wtime();
    int ret;

    ret = check_fused(res,map,SIZE,keys);
//    ret = check_split(res,map,SIZE,keys,tmp);

    double end = omp_get_wtime();

    printf("ret = %d",ret);
    printf("\n\nseconds = %f\n",end - start);

    system("pause");
}

This is a valuable analysis, thank you. So, first map[hash(key)] gets put in the load queue. Now, I am not sure what happens next. Is the CPU going to put res[ret] in the store queue, with the old ret value and later on re-execute this and cause the slow-up? Or, is it going to wait for the load and put the correct res[ret] in the store queue. — user16367, Jun 21 '12 at 14:07
That's a low level detail that I'm unsure of (and quite possibly proprietary to Intel). It's definitely not caused by misprediction of `ret`. (The timings are the same even when `ret` is always `0` or `1`). So I suspect that the latter is closer. Perhaps it can't go into the store buffer until the address is known and disambiguated - thus backing up the entire instruction re-order buffer. — Mysticial, Jun 21 '12 at 17:05

Ben Voigt · Answer 2 · 2012-06-20T16:21:49.167

1

I don't think it's the array indexing, but the call to the function hash() that may cause a pipeline stall and prevent optimal instruction reordering.

edited Jun 20 '12 at 16:21

answered Jun 20 '12 at 16:13

Ben Voigt

277,958
43
419
720

map[..] returns 0 or 1 so access to res is sequential. Can you elaborate why hash would cause a stall? hash is actually a #define but even if it was a function, why would it stall without fission? – user16367 Jun 20 '12 at 16:20
Also, the calls through `map[]` and `hash()` may evict enough of the cache that the accesses through `res[]` all miss. The second loop after fission would then have a significantly better hit rate. But that's likely somewhat subjective, depending on how big `n` really is. Small cases may not see much improvement at all. – twalberg Jun 20 '12 at 16:29
n would be between 500 and 1000 in my case, so that keys and res fit in the cache. map is usually big and does not completely fit in the cache. – user16367 Jun 20 '12 at 16:48

Why does loop fission make sense in this case?

2 Answers2