Reduce the number of executions by 3 times, but the execution efficiency is almost unchanged. In C

Question

In C, I reduced the total number of loop executions by nearly 3 times, but through testing the execution time, I found that there is almost no improvement in doing so. All optimization levels have been tested, and the results are basically the same(including O0, O1, O2 and O3). I guess it’s a problem with the compiler, but I want to know what causes this situation. And what to do to make the results meet expectations.

The code is as follow:

#include <stdio.h>
#include <time.h>
#include <stdlib.h>

#define Len 10000000

// Two variables that count the number of loops
int count1 = 0;
int count2 = 0;

int main(int argc, const char * argv[]) {
    srandom((unsigned)time(NULL));
    
    // An array to increase the index,
    // the range of its elements is 1-256
    int rand_arr[128];
    for (int i = 0; i < 128; ++i)
        rand_arr[i] = random()%256+1;
    
    // A random text, the range of its elements is 0-127
    char *tex = malloc((sizeof *tex) * Len);
    for (int i = 0; i < Len; ++i)
        tex[i] = random()%128;
    
    // The first testing
    clock_t start = clock();
    for (int i = 0; i < Len; i += rand_arr[tex[i]])
        count1++;
    printf("No.1: %lf s\n", ((double)(clock() - start)) / CLOCKS_PER_SEC);
    
    // The second testing (x3)
    start = clock();
    for (int i = 0; i < Len; i += rand_arr[tex[i]]+256)
        count2++;
    printf("No.2: %lf s\n", ((double)(clock() - start)) / CLOCKS_PER_SEC);
    
    printf("count1: %d\n", count1);
    printf("count2: %d\n", count2);
    
    return 0;
}

The print result(average) is as follows：

No.1: 0.002213 s
No.2: 0.002209 s
count1: 72661
count2: 25417

Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/237267/discussion-on-question-by-linke-reduce-the-number-of-executions-by-3-times-but). — Machavity, Sep 19 '21 at 21:04
related / followup question: [Why does adding an if(!memcmp()) speed up a loop that makes random short strides through a huge byte array?](https://stackoverflow.com/q/69213272) - adding memcmp which may be acting as a prefetch makes it faster, vs. here we have an extra offset to the random stride presumably making HW prefetch do worse or at least be less effective (since the access pattern has less locality). — Peter Cordes, Sep 20 '21 at 15:41

score 5 · Accepted Answer · answered Sep 16 '21 at 20:19

The problem comes from the processor itself and not the compiler. This is a complex problem related to the behaviour of the CPU caches, CPU prefetching units and the random access pattern.

Both codes read the tex array based i value that cannot be easily predicted ahead of time by the processor because of the random increments stored in the rand_arr. Because tex is relatively big, it will likely not be fully stored in L1 cache (nor in an intermediate L2 cache if any) but rather in the last-level cache (LLC) or even in RAM. As a result, tex need to be reloaded in each loops from the LLC cache or the RAM. The latency of the LLC cache and the RAM are relatively big nowadays. This thing is that the second loop is harder to predict and less cache-friendly than the first although the number of iteration is smaller!

On x86 CPU, caches pack values by block of 64 bytes called a cache line. When a value is fetched from the main memory or another cache (typically due to a cache miss), a full cache line is fetched. The following accesses to the same cache line are faster because the CPU do not need to fetch it again (as long as the cache line is not invalidated).

In the first loop, the average increment of i is 128 (because the mean of rand_arr is 128). This means that the average stride between two fetched item from tex is 128. In the worst case, the stride is 256. In the second loop, the average stride is 256+128=384 and in the worst case it is 256+256=512. When the stride is smaller than 64, there is a high probability that is will be already fetched in the first case while this is never the case in the second loop. Moreover, the prefetching units can prefetch cache line when several access are contiguous or close each other. This enable the processor to most items of the tex array ahead of time in the first loop. Meanwhile, in the second loop the prefetcher will likely fail to recognize any cache line fetching access. The prefetching units will probably not prefetch anything (because it is too expensive to do that) and the result is many cache misses with a high latency that cannot be mitigated because the accesses are inherently sequential and unpredictible. If the prefeteching units decide to prefetch all the cache lines, then the second loop should not be faster than the first (because the two loops are both bound by the memory hierarchy).

Note that random and srandom are not standard functions. Also, be aware that clock is not always precise on all platforms. Actually, it as a precision of 1 ms on my Windows (using GCC and MinGW). This can also be seen on some Linux systems.

In the followup question to this, where tex[] is large enough to measure a real effect, I think the wider load from memcmp is acting as SW prefetching to make it more common to read contiguous lines and trigger the L2 streamer prefetcher. See my answer on [Why does adding an if(!memcmp()) speed up a loop that makes random short strides through a huge byte array?](https://stackoverflow.com/a/69216525) — Peter Cordes, Sep 17 '21 at 00:44
Interesting! Note that I tried doing some naive prefetching by forcing the processor to load the next cache line that will likely be loaded but it was not significantly faster (typically due to the latency of the L3/RAM). I finally tried to prefetch big chunks of 8192KB before working on it and it was about *2.5 faster* (reaching a memory throughput of 22 GB/s). I think it is not faster because AFAIK one core cannot saturate my RAM bandwidth. Using prefetching threads may help to reach 40 GB/s on my machine and should result in a *4 times faster overall execution* compared to the initial code. — Jérôme Richard, Sep 17 '21 at 11:07
On a typical Intel "client" CPU like my quad-core Skylake, a single core can come pretty close to saturating DRAM bandwidth, like 90% at least IIRC. (Very much unlike a many-core Xeon, especially Skylake-Xeon which has even worse per-core bandwidth but better aggregate bandwidth than Broadwell-Xeon. [Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?](https://stackoverflow.com/q/39260020)) — Peter Cordes, Sep 17 '21 at 11:16
Neat idea. The trick here might be building a prefetch that stays in sync with the actual dep chain's random advancement, but exposing plenty of memory-level parallelism by not just doing one extra access per step of the dep chain. Like maybe loading the next 10 cache lines every 2 or 4 steps. (12 total LFBs on Skylake, so up to 12 outstanding cache lines of L1d misses), hopefully triggering the L2 streamer to stay active from contiguous access from L1, and page-crossing to start early page walks. Either demand load with a dummy `volatile char*`, or `_mm_prefetch`. — Peter Cordes, Sep 17 '21 at 11:18

score -5 · Answer 2 · answered Sep 16 '21 at 16:51

-5

There are a couple of things that could be causing this:

If you're timing your code, you should have:

startTime = clock();
CODE
endTime = clock();

Then print your results/do analysis on it after. You do some math on it and use the PRINTF function which is horrifically inefficient as far as timing goes. Also the cast to double is not necessary, and can be causing the vast majority of your timing, as double math is horrifically slow. Stick to int as this is potentially 1000x faster

odd for loop code - the standard for for loop equations is:
```
for(int i = 0;i<length;i++)
      Code
```

You have

  for(int i = 0;i<length;code)
       i++;

Which is odd as far as syntax goes, and may be affecting your timing

clock() - this may be affecting your timing. If clock() is returning a double I would suggest doing it a different way, with a function that returns int or unsigned int, as doubles will destroy your timing as stated above. If you're worried about this, I'd suggest testing it by way of the following:
```
startTime = clock()
for(int i = 0;i<10000;i++)
     dummy = clock()
endTime = clock()
totalTime = (endTime - startTime)/10000;
```
for loop - this in itself can be the main source of your base-timing (although it's unlikely, especially since it doesn't seem like you're doing any particularly complicated math. You can fix this buy using the #pragma unroll processor instruction, which will basically copy and paste all iterations of your for loop into your code, removing it's timing affects

answered Sep 16 '21 at 16:51

Mike Stallone

52
5

3

You're not going to get subnormal `double` values here, so it won't be anywhere near 1000x slower than int. Maybe at worst 5x, but its cost will overlap with other things like loop overhead because this is running on a modern x86-64 CPUs, so it's out-of-order exec. – Peter Cordes Sep 16 '21 at 21:38
Also, `printf` isn't inside the timed region, so not clear what your point is. `clock()` needs to be evaluated before the rest of the expression involving it, thanks to order of operations, and there's no sane transformation that would result in any work being done first. It certainly has to be evaluated before it can be passed as an arg to `printf`. So if you're repeating your suggesting from comments under the question that `foo=clock(); printf("...", foo)` is different from `printf("...", clock())`, that's definitely wrong. Check the compiler-generated asm, which is what truly runs. – Peter Cordes Sep 17 '21 at 09:54
It's only in the followup question ([Why does adding an if(!memcmp()) speed up a loop that makes random short strides through a huge byte array?](https://stackoverflow.com/q/69213272)) that there's conditional printing inside the timed region, but the condition is never true so all that forces is keeping loop variables in call-preserved registers. (Because the compiler doesn't *know* it's going to never actually call). – Peter Cordes Sep 17 '21 at 09:55
@PeterCordes how do you know the implementation OP is running this on? He did not mention it. I could be off on how much it affects timing but I'm not wrong that it unnecessarily affects his timing – Mike Stallone Sep 17 '21 at 15:30
And yea it does make sense that clock() evaluates before the printf. It's not good practice for this sort of test, however. Especially when that bad practice leads you to cause the compiler to potentially do math and casts before evaluating said clock() – Mike Stallone Sep 17 '21 at 15:33
I know from the followup that they tested clang on MacOS and GCC on Linux, both on [x86-64] from the tags. But I don't need to know that to know that capturing values to named variables or not doesn't make a different to the actual dataflow graph that the compiler optimizes. `clock() - start` is an integer calculation, the *result* of which is being cast to `double`. Until that subtraction is done, there's nothing to cast. It's not bad practice to put the ending `clock()` in the same statement as some calculation that uses it. – Peter Cordes Sep 17 '21 at 21:34
It might actually be legal to transform `(double)(x - y)` into `(double)x - (double)y` for int, because `double` can exactly represent every `int` value so it's still exact with no rounding error. (For IEEE binary64 `double` and 32-bit `int`). No real-world compiler would actually de-optimize that way, though, so it's not something we have to worry about. But even if it did, one FP conversion before `clock()` is trivial compared to the cost of calling `clock()` itself, or the whole loop inside the timed region, especially on an out-of-order superscalar CPU like all modern x86-64. – Peter Cordes Sep 17 '21 at 21:38
I'd agree with you if there was any realistic chance of having extra code inside the timed region, but if you look at asm output from mainstream x86-64 compilers, that doesn't happen. (https://godbolt.org/z/PPPb11de7 shows gcc -O0 and -O2.) **If a compiler wanted to transform and do a cast before the 2nd call to `clock`, it could still do that if you wrote separate statements**. The as-if rule isn't limited to within one statement. (Unless you compile with `-O0` optimization disabled, which is generally a bad idea for benchmarking.) – Peter Cordes Sep 17 '21 at 21:46
@PeterCordes it is absolutely bad practice to put clock() in the same statement as math, and I beg you to think of any different scenario than the one you're currently looking at. What if the statement was instead `clock() - (x + y)`? The statement inside parentheses will most likely be analyzed first. And if the cast isn't analyzed first, how do you know the load of the startTime, or it's conversion to negative aren't done first? It just seems very unnecessary why you're arguing with me about this, and saying things like "it wouldn't affect things THAT much". If it affects timing, it's bad – Mike Stallone Sep 20 '21 at 15:20
But fair enough on the casting. I just don't see why this advanced compiler knowledge needs to be a pre-requisite for writing a basic timing function. Why not just completely save the possibility that your timing is being adversely affected by unnecessary statements by just writing things in a way where you can be absolutely certain nothing is affecting your call to clock()? – Mike Stallone Sep 20 '21 at 15:22
`clock()` is not a cheap function; a couple instructions being before or after it is close to irrelevant. It's not like you're inlining `_mm_lfence(); time = __rdtsc()`, and even that has like 20 cycles of overhead so it's pretty hard to measure a single instruction or two. (But yes not impossible). Part of the reason I'm arguing is that these effects aren't part of the explanation for this question's timing, and these effects are insignificant for anything that takes long enough for `clock()` to be a viable way to measure it. – Peter Cordes Sep 20 '21 at 15:33
Equally importantly, separate local variables and statements does *not* correlate with what the compiler puts in the timed region or not when optimizing. Compilers transform your code into a dataflow graph then figure out how to implement that in asm, and they know that locals whose address doesn't "escape" the current function won't be modified by calls to functions like clock. https://godbolt.org/z/qdnWx56M3 shows that compilers just do whatever they want, with not much connection to how the source is written. – Peter Cordes Sep 20 '21 at 15:35
We're not timing `-O0` code here (or at least we shouldn't be because that's usually not useful or interesting); only in that case does each C statement compile to a separate block of asm, with the blocks in source order. So basically I'm arguing because these points aren't real problems, and not something that makes a benchmark better. If you aren't writing the asm by hand, all you can do is not leave independent work that the compiler might want to sneak into the timed region. `printf("...", start - clock())` is fine. – Peter Cordes Sep 20 '21 at 15:38
If the actual question had used `start - clock() + x` that would be very slightly different, and at least worth talking about, but they didn't. And as I said, `t = clock(); start - t + x;` wouldn't be different when optimization is enabled, which you need to do to meaningfully benchmark your code. – Peter Cordes Sep 20 '21 at 15:39
I thought about this some more. Your suggestions don't make things *worse*, but they're not helpful either, and promote a misunderstanding of how things work. Especially the suggestion that printf is part of the timed region is clearly impossible. If you were just suggesting barely-plausible but unlikely to matter stuff like your `clock() - (x + y)` example, I might not have downvoted, but this answer has multiple truly misleading arguments. – Peter Cordes Sep 20 '21 at 17:18

Reduce the number of executions by 3 times, but the execution efficiency is almost unchanged. In C

2 Answers2

Linked