2

I have just read a blogpost here and try to do a similar thing, here is my code to check what is in example 1 and 2:

int doSomething(long numLoop,int cacheSize){
    long k;
    int arr[1000000];
    for(k=0;k<numLoop;k++){
        int i;
        for  (i = 0; i < 1000000; i+=cacheSize) arr[i] = arr[i];
    }
}

As stated in the blogpost, the execution time for doSomething(1000,2) and doSomething(1000,1) should be almost the same, but I got 2.1s and 4.3s respectively. Can anyone help me explain? Thank you.

Update 1: I have just increased the size of my array to 100 times larger

int doSomething(long numLoop,int cacheSize){
    long k;
    int * buffer;
    buffer = (int*) malloc (100000000 * sizeof(int));
    for(k=0;k<numLoop;k++){
        int i;
        for  (i = 0; i < 100000000; i+=cacheSize) buffer[i] = buffer[i];
    }
}

Unfortunately, the execution time of doSomething(10,2) and doSomething(10,1) are still much different: 3.02s and 5.65s. Can anyone test this on your machine?

Ben Jackson
  • 90,079
  • 9
  • 98
  • 150
boh
  • 1,477
  • 2
  • 16
  • 35
  • 1
    Are you using the same exact processor? Different processors have different cache sizes. – Mysticial Sep 27 '12 at 00:53
  • try making the array bigger and see what happens, in the link, he is using 64MB you are using 1MB you may never get a direct memory access. possibly. – Keith Nicholas Sep 27 '12 at 00:57
  • hmm, you are correct but I am expecting to see the difference between L1 and L2, L3 cache. Let me check with a larger array. – boh Sep 27 '12 at 01:18
  • Please check my update1. – boh Sep 27 '12 at 01:32
  • 1
    use valgrind --tool=cachegrind ./your program, check the Dr,Dw, then you will know the cpu cache is the main reason of your difference. – MYMNeo Sep 27 '12 at 01:50
  • @MYMNeo, as what I read from valgrind, the misses of the 2 function calls are the same (which means the execution time should be the same), but the real execution time are different. I am not going deeper as Ben Jackson's solution worked for me. Thank you :) – boh Sep 27 '12 at 02:17
  • 1
    @navie, see this to know more about cache, http://stackoverflow.com/questions/11413855/why-is-transposing-a-matrix-of-512x512-much-slower-than-transposing-a-matrix-of – MYMNeo Sep 27 '12 at 02:41

2 Answers2

2

With doSomething(1000, 2) you are doing the innner loop half the count of when you use doSomething(1000,1).

Your inner loop increments by cacheSize so a value of 2 gives you only half the iterations of a value of 1. The array is sized around 4MB which is a typical virtual memory page size.

Actually I am a bit surprised that a good compiler just does not optimize this loop away since there is no variable change. That might be part of the issue.

Richard Chambers
  • 16,643
  • 4
  • 81
  • 106
2

Your array size of 4M is not big enough. The entire array fits in the cache (and is in the cache after the first k loop) so the timing is dominated by instruction execution. If you make arr much bigger than the cache size you will start to see the expected effect.

(You will see an additional effect when you make arr bigger than the cache: Runtime should increase linearly with arr size until you exceed the cache, when you will see a knee in performance and it will suddenly get worse and runtime will increase on a new linear scale)

Edit: I tried your second version with the following changes:

  1. Change to volatile int *buffer to ensure buffer[i] = buffer[i] is not optimized away.
  2. Compile with -O2 to ensure the loop is optimized sufficiently to prevent loop overhead from dominating.

When I try that I get almost identical times:

kronos /tmp $ time ./dos 2
./dos 2  1.65s user 0.29s system 99% cpu 1.947 total
kronos /tmp $ time ./dos 1
./dos 1  1.68s user 0.25s system 99% cpu 1.926 total

Here you can see the effects of making the stride two full cachelines:

kronos /tmp $ time ./dos 16
./dos 16  1.65s user 0.28s system 99% cpu 1.926 total
kronos /tmp $ time ./dos 32
./dos 32  1.06s user 0.30s system 99% cpu 1.356 total
Ben Jackson
  • 90,079
  • 9
  • 98
  • 150