The execution time of C calling multiple the same assembly is increasing exponentially

Question

The following C code should simply execute p times the same assembly code, which in turns should only decrease of the ecx register in sixteen loops from 16 to 0.

When p is small, the program completes quickly, but when p is high (say p = 16), its execution time increases exponentially.

#include <stdio.h>
#include <stdlib.h>

int main() {
    int p = 16;
    int i;
    for(i=0; i<p; i++) { 
        int c = 16;
        __asm__(
            "mov %[c], %%rcx \n"
            "loop: \n" 
                "sub $1, %%rcx \n"
                "jnz loop \n"
            : 
            : [c]"m" (c)
            : "rcx"
        );
    }
    return 0;
}

Strangely enough, when adding some lines to measure the execution time, the program completes as fast as expected, without any exponential increase effect:

#include <stdio.h>
#include <stdlib.h>
#include <time.h> //added

int main() {
    int p = 16;
    int i;
    clock_t start, end; //added
    start = clock(); //added
    for(i=0; i<p; i++) { 
        int c = 16;
        __asm__(
            "mov %[c], %%rcx \n"
            "loop: \n" 
                "sub $1, %%rcx \n"
                "jnz loop \n"
            : 
            : [c]"m" (c)
            : "rcx"
        );
    }
    end = clock(); //added
    float time = (float)(end - start)/CLOCKS_PER_SEC; //added
    printf("Time spent: %f\n", time); //added
    return 0;
}

How to avoid such an issue?

Presumably the compiler unrolled the small loops. Consult generated assembly and/or compiler options. Also, adding code changes alignment. — Jester, Apr 03 '21 at 23:32
Can you tell us some concrete timing numbers, such as "with p=16 the first version runs in 40 ms and the second version runs in 50 ms" and so on for p=160, 1600, 16000, etc.? And tell us what compiler and what compiler options you used. — John Zwinck, Apr 03 '21 at 23:33
What exact compiler / version and options? What exact CPU model? (e.g. i7-6700k). Did you warm up the CPU first to avoid CPU-frequency variation? 16 * 16 iterations is still absolutely tiny, like 256 clock cycles (or 64 nanoseconds on a 4GHz CPU), should be lost in the noise of measurement overhead for the whole process, so IDK how you think you have meaningful times. (16 iterations is small enough for the inner-loop exit to predict correctly, at least on my Skylake, if the outer loop runs enough iterations for it to learn that pattern.) — Peter Cordes, Apr 03 '21 at 23:41
Also, show the disassembly (of the whole `main`) for the loops on a version that's unexpectedly slow. You might be running into weird code-alignment effects like the `sub` and `jnz` being split across a 64-byte boundary so (IIRC) they can't macro-fuse, and will have to get fetched separately from 2 different uop-cache lines on Skylake-family (with loop buffer disabled by a microcode update). Or from L1i cache with the [JCC erratum microcode workaround](https://stackoverflow.com/a/61016915/224132) — Peter Cordes, Apr 03 '21 at 23:45
Also related: [Idiomatic way of performance evaluation?](https://stackoverflow.com/a/60293070) for benchmarking pitfalls like dynamic CPU frequency. — Peter Cordes, Apr 03 '21 at 23:46
The bug Nate identified wouldn't cause *exponential* increases; if `p` is getting loaded into the high half of RCX, it should be quadratic. Like `p * p <<32`. (With the `c=16` in the low half of RCX being negligible.) Anyway, single-step with a debugger and look at register values; it will be easy to confirm what happened. — Peter Cordes, Apr 04 '21 at 00:17

Nate Eldredge · Accepted Answer · 2021-04-04T01:37:48.070

You have mov %[c], %%rcx but c is only int. If the next four bytes following c in memory happen to be nonzero, your asm loop will execute many billions of iterations instead of just 16.

Change c to long int (or int64_t for portability to systems where long isn't 64-bit), or use mov %[c], %%ecx to zero-extend into RCX, or movsxd %[c], %%rcx to sign-extend.

Actually, there's no particular need to load rcx from memory; let the compiler do it for you by creating an input/output operand with the c constraint. Starting an asm template with mov is inefficient.

        unsigned long c = 16;
        __asm__ volatile(
            "0: \n" 
                "sub $1, %%rcx \n"
                "jnz 0b \n"
            : "+c" (c));  // "c" forces the compiler to pick RCX

Note that volatile is needed now, since the asm now has an output operand which isn't used afterward, so the compiler might otherwise optimize away the whole block. (This would also have been an issue with your original code, except that there's a special exception for asm statements with no output operands at all. I tend to not to like relying on this exception as it can be hard to remember exactly when it applies, and easy to accidentally change the code so that it no longer does. Use volatile whenever deleting the asm block would be unacceptable.)

I've also used a local label so that the code will assemble properly in case the compiler decides to unroll the loop.

Instead of hardcoding %rcx, you could use a "+r" constraint, and use dec %[c] in the loop to let the compiler choose your count register. With int c it would have picked EAX or ECX, not RCX.

"On the other hand, since you haven't marked your `asm` as `volatile`, the compiler is at liberty to remove it altogether, since it can't tell that it has any effect on the remaining code." But this asm has no outputs, and [GCC's manual](https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html) says "`asm` statements that have no output operands and `asm goto` statements, are implicitly volatile." — Joseph Sible-Reinstate Monica, Apr 04 '21 at 01:28
@JosephSible-ReinstateMonica: You're right, that applies to my modified version but not to the original. I've edited. — Nate Eldredge, Apr 04 '21 at 01:38

The execution time of C calling multiple the same assembly is increasing exponentially

1 Answers1