Clang optimizes out RDTSC asm blocks thinking the repeated block yields the same as the previous block. Is this legal?

Question

Supposed we have some repetitions of the same asm that contains RDTSC such as

    volatile size_t tick1;
    asm ( "rdtsc\n"           // Returns the time in EDX:EAX.
          "shl $32, %%rdx\n"  // Shift the upper bits left.
          "or %%rdx, %q0"     // 'Or' in the lower bits.
          : "=a" (tick1)
          : 
          : "rdx");
    
    this_thread::sleep_for(1s);

    volatile size_t tick2;    
    asm ( "rdtsc\n"          // clang's optimizer just thinks this asm yields
          "shl $32, %%rdx\n" // the same bits as above, so it just loads
          "or %%rdx, %q0"    // the result to qword ptr [rsp + 8]
          : "=a" (tick2)     // 
          :                  //   mov     qword ptr [rsp + 8], rbx
          : "rdx");

    printf("tick2 - tick1 diff : %zu cycles\n", tick2 - tick1);
    printf("CPU Clock Speed    : %.2f GHz\n\n", (double) (tick2 - tick1) / 1'000'000'000.);

Clang++'s optimizer (even with `-O1` ) thinks those two asm blocks yield the same :

tick2 - tick1 diff : 0 cycles
CPU Clock Speed    : 0.00 GHz

tick1              : bd806adf8b2
this_thread::sleep_for(1s)
tick2              : bd806adf8b2

When turn off Clang's optimizer, the 2nd block yields progressing ticks as expected :

tick2 - tick1 diff : 2900160778 cycles
CPU Clock Speed    : 2.90 GHz

tick1              : 14ab6ab3391c
this_thread::sleep_for(1s)
tick2              : 14ac17902a26

1st GCC g++ "seems" not to affect from this.

tick2 - tick1 diff : 2900226898 cycles
CPU Clock Speed    : 2.90 GHz

tick1              : 20e40010d8a8
this_thread::sleep_for(1s)
tick2              : 20e4aceecbfa

[LIVE]

However, let's add tick3 with the exact asm right after tick2

    volatile size_t tick1;
    asm ( "rdtsc\n"           // Returns the time in EDX:EAX.
          "shl $32, %%rdx\n"  // Shift the upper bits left.
          "or %%rdx, %q0"     // 'Or' in the lower bits.
          : "=a" (tick1)
          : 
          : "rdx");
    
    this_thread::sleep_for(1s);

    volatile size_t tick2;    
    asm ( "rdtsc\n"          // clang's optimizer just thinks this asm yields
          "shl $32, %%rdx\n" // the same bits as above, so it just loads
          "or %%rdx, %q0"    // the result to qword ptr [rsp + 8]
          : "=a" (tick2)     // 
          :                  //   mov     qword ptr [rsp + 8], rbx
          : "rdx");

    volatile size_t tick3;
    asm ( "rdtsc\n"          
          "shl $32, %%rdx\n"   
          "or %%rdx, %q0"    
          : "=a" (tick3)
          : 
          : "rdx");

It turns out that GCC thinks tick3's asm must produce the same value as tick2 because there are "obviously" no external side effects, so it just reload from tick2 . Even that's wrong, well, it has a very strong point though.

tick2 - tick1 diff : 2900209182 cycles
CPU Clock Speed    : 2.90 GHz

tick1              : 5670bd15088e
this_thread::sleep_for(1s)
tick2              : 567169f2b6ac
tick3              : 567169f2b6ac

[LIVE]

In C mode, the optimizers of both GCC and Clang affect with this.
In other words, even with -O1 both optimize out the repetitions of asm blocks containing rdtsc

tick2 - tick1 diff : 0 cycles
CPU Clock Speed    : 0.00 GHz

tick1              : 324ab8f5dd2a
thrd_sleep(&(struct timespec){.tv_sec=1}, nullptr)
tick2              : 324ab8f5dd2a
tick3_rdx          : 324b65d3368c

[LIVE]

It turns out that all optimizers can do common-subexpression elimination on identical non-volatile asm statements, so an asm statement for RDTSC needs to be volatile.

*It turns out that all optimizers make wrong assumption about RDTSC.* - That phrasing makes it sound like the compiler's fault. Compilers don't look inside the asm template string, that's why it's up to the programmer to write constraints that accurately describe the asm statement. If it can produce different outputs when run twice in a row, you must tell the compiler it's `volatile`. It's not "a wrong assumption about `rdtsc`", it's an assumption about a non-`volatile` asm statement that happens to contain `rdtsc`. — Peter Cordes, Aug 16 '23 at 17:20
@PeterCordes Please feel free to rephrase that. However I'm afriad if I rephrase away from that perspective, it might confuse some readers. That's all. — sandthorn, Aug 16 '23 at 17:42
I rephrased it for you. Compilers make assumptions about `asm` statements based, so if you want to use `asm` to read `rdtsc`, you need `asm volatile`. — Peter Cordes, Aug 16 '23 at 17:46

David Grayson · Accepted Answer · 2023-08-15T07:04:55.790

9

Inline assembly is not covered by the C++ standard, so I'm not sure what your definition of "legal" is here. The behavior you are seeing makes sense to me though, because you are running inline assembly for its side effects (i.e. your assembly doesn't implement a pure function) and you forgot to use the volatile keyword. From the GCC inline assembly documentation:

The typical use of extended asm statements is to manipulate input values to produce output values. However, your asm statements may also produce side effects. If so, you may need to use the volatile qualifier to disable certain optimizations.

Also:

GCC's optimizers sometimes discard asm statements if they determine there is no need for the output variables. Also, the optimizers may move code out of loops if they believe that the code will always return the same result (i.e. none of its input values change between calls). Using the volatile qualifier disables these optimizations.

If you insert the volatile keyword immediately after asm the problem goes away.

P.S. Instead of using inline assembly, just include x86intrin.h and then use __rdtsc() function.

edited Aug 15 '23 at 07:04

answered Aug 15 '23 at 06:44

David Grayson

84,103
24
152
189

Should we file a bugzilla issue for the inconsistency of GCC between C and C++ mode? – sandthorn Aug 15 '23 at 06:55
1

@sandthorn Why file a bug report? There is no bug. You forgot `volatile` so the compiler was allowed to optimize as it did. Also, C and C++ are different languages, you can't expect similar behaviour between them. – Jesper Juhl Aug 15 '23 at 07:14
Please clearify what do you mean by "implement a pure function." (a) if I write in a function (without `volatile`), firstly `gcc` looks like to recogize `RDTSC` and `inline` it correctly but when I call it twice adjacently, it still reload wrongfully. While `clang` always reloads the old result. [[LIVE](https://compiler-explorer.com/z/ss5P983ec)] (b) If I write as an `extern "C"`, it works while nobody can inline it at all. [[LIVE](https://compiler-explorer.com/z/6n5vvT81d)] So after all, I have to add `volatile` anyway? [[LIVE](https://compiler-explorer.com/z/ss5P983ec)] – sandthorn Aug 15 '23 at 09:54
1

I'm talking about a pure function in the mathematical sense: no side effects, and the value of the output operands at the end are entirely determined by the input operands and nothing else. If you need more clarification on that you'd have to dive into the GCC documentation probably. – David Grayson Aug 15 '23 at 15:04
1

@sandthorn: You could file a missed-optimization bug about the fact that G++ *doesn't* reuse the same result of a non-`volatile` asm statement (which it can assume depends only on the inputs, but there are no inputs so it can assume it returns a constant). It seems to miss the optimization even when the variables are non-`volatile`, and you were already using `-O3`. It probably compare the text of the asm statement, so you'd have to wrap it in a function to get this optimization, it has to be the *same* asm statement, not another one with the same template string (instructions). Not a bug. – Peter Cordes Aug 15 '23 at 16:41
1

@sandthorn: It's not C vs. C++ that seems to make a difference for GCC, it's `std::this_thread::sleep_for(1s)` vs. `sleep(1)` from `unistd.h`. Even compiling `sleep(1)` as C++, G++ is still able to reuse the result from two different asm statements that have the same template and constraints. https://godbolt.org/z/q7j77qrxe (the compiler used `-xc` to compile as C, the executor I left as C++) So maybe it is a missed-optimization bug at least when the local vars are non-`volatile`. – Peter Cordes Aug 15 '23 at 17:27
@PeterCordes In original question, I mistakenly stated that `g++` optimized correctly but it's not true. If `tick2` and `tick3` use the same assembly and be adjacent to each other, the result is reloaded!!! [[LIVE](https://compiler-explorer.com/z/K81Tv9Ybq)] The GCC document even examplifies this case in `volatile` section. I'm afriad this is intended behaviour by the implementation. – sandthorn Aug 16 '23 at 13:26
1

@sandthorn: They use different operand constraints and different asm template strings so it's not surprising GCC treats them as different. In fact it would be really bad if GCC assumed that *different* asm statements would produce the same output. – Peter Cordes Aug 16 '23 at 14:50
@PeterCordes I have edited my question to correct my statement in the original question. Please edit as you think it is needed. – sandthorn Aug 16 '23 at 17:19
In fact `rdtsc` is used as an explicit example in the documentation. – Nate Eldredge Aug 19 '23 at 02:32

score 3 · Answer 2 · answered Aug 15 '23 at 14:03

Update

Thanks to @DavidGrayson for a great answer.

TL;DR

Simply nothing beats the granularity of intrinsics.
GCC's optimizer has full rights (by documentation) to make assumptions even that sometimes turns out wrongful about asm .
Optimizers have no rights to make assumptions about volatile asm blocks except moving them.
That is optimizers have full rights to move even volatile asm blocks so that two consecutive volatile asm blocks be compiled into non-consecutive ones.

    auto tick1  = __rdtsc();
    
    this_thread::sleep_for(1s);

    auto tick2  = __rdtsc();
    auto tick3  = __rdtsc();

    printf("tick2 - tick1 diff : %llu cycles\n", tick2 - tick1);
    printf("CPU Clock Speed    : %.2f GHz\n\n", (double) (tick2 - tick1) / 1'000'000'000.);

It just works.

tick2 - tick1 diff : 2900206596 cycles
CPU Clock Speed    : 2.90 GHz

tick1              : 3ee4e9f13612
this_thread::sleep_for(1s)
tick2              : 3ee596ceda16
tick3              : 3ee596ceda32

Moreover when you look at the codegen, the compiler just has more degrees of freedom to optimize by rearranging and even blending things around in so many ways that is beyond possiblity of hand-written.

Clang:

        rdtsc
        mov     r14, rdx
        mov     rcx, rax
        rdtsc
        mov     r15, rdx
        shl     r14, 32
        or      r14, rcx
        shl     r15, 32
        or      r15, rax

GCC:

        rdtsc
        mov     rbx, rax
        sal     rdx, 32
        or      rbx, rdx
        rdtsc
        mov     edi, OFFSET FLAT:.LC1
        mov     r13, rbx
        sal     rdx, 32
        mov     rbp, rax
        xor     eax, eax
        sub     r13, r12
        or      rbp, rdx

[LIVE]

You could let the optimizer move things around with inline asm, too, if you define the `asm` statement as producing two separate outputs for the two halves, and do the shift/OR in C. See the example at the bottom of my answer on [How to get the CPU cycle count in x86\_64 from C++?](https://stackoverflow.com/a/51907627) where using `+` instead of `|` sometimes lets the compiler shift + LEA to combine into a third register in 2 instructions instead of 3 (no `mov`). You can use a `"memory"` clobber to stop it moving relative to non-inline function calls. — Peter Cordes, Aug 15 '23 at 16:46
But I'd still recommend using the intrinsic for simplicity and portability. — Peter Cordes, Aug 15 '23 at 16:46

Clang optimizes out RDTSC asm blocks thinking the repeated block yields the same as the previous block. Is this legal?

2 Answers2

Update