0

If a variable is not specified with the keyword volatile, the compiler likely does caching. The variable must be accessed from memory always otherwise until its transaction unit ends. The point I wonder lies in assembly part.

int main() {
    /* volatile */ int lock = 999;
    while (lock);
}

On x86-64-clang-3.0.0 compiler, its assembly code is following.

main:                                   # @main
        mov     DWORD PTR [RSP - 4], 0
        mov     DWORD PTR [RSP - 8], 999


.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        cmp     DWORD PTR [RSP - 8], 0
        je      .LBB0_3
        jmp     .LBB0_1


.LBB0_3:
        mov     EAX, DWORD PTR [RSP - 4]
        ret

When volatile keyword is commented in, it turns out the following.

main:                                   # @main
        mov     DWORD PTR [RSP - 4], 0
        mov     DWORD PTR [RSP - 8], 999


.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        mov     EAX, DWORD PTR [RSP - 8]
        cmp     EAX, 0
        je      .LBB0_3
        jmp     .LBB0_1


.LBB0_3:
        mov     EAX, DWORD PTR [RSP - 4]
        ret

The points I wonder and don't understand,

  • cmp DWORD PTR [RSP - 8], 0 . <--- Why is the comparison done with 0 whilst DWORD PTR [RSP - 8] holds 999 within ?
  • Why is DWORD PTR [RSP - 8] copied into EAX and again why is the comparison done between 0 and EAX?
  • 4
    `while (lock)` doesn't care what the value of `lock` is. It only cares whether `lock` is 0 or not 0. – user3386109 Mar 13 '19 at 06:17
  • 1
    compilers dont cache, nor generally have no clue about caches. its not a matter of caching it is a matter of whether or not the compiler is requested to perform a memory cycle to load or store the variable from/to memory before moving to the next operation. – old_timer Mar 13 '19 at 06:45
  • What other value would you use to compare against in the condition? You only use `while (lock)`, not `while (lock == 999)`. – Gerhardh Mar 13 '19 at 07:16
  • 3
    @old_timer: Compilers do cache. “Cache” is not just a name for certain memory; it is a common English word meaning to store for future use. Compilers cache when they hold a value in a register for repeated use instead of reloading it from memory each time source code refers to it. That is the sense in which the OP used the word. – Eric Postpischil Mar 13 '19 at 11:09
  • 2
    There is no point in reasoning about unoptimised code. Always turn on optimisations before questioning what code the compiler generated. – fuz Mar 13 '19 at 11:21

2 Answers2

5

It looks like you forgot to enable optimization. -O0 treats all variables (except register variables) pretty similarly to volatile for consistent debugging.

With optimization enabled, compilers can hoist non-volatile loads out of loops. while(locked); will compile similarly to source like

if (locked) {
    while(1){}
}

Or since locked has a compile-time-constant initializer, the whole function should compile to jmp main (an infinite loop).

See MCU programming - C++ O2 optimization breaks while loop for more details.


Why is DWORD PTR [RSP - 8] copied into EAX and again why is the comparison done between 0 and EAX?

Some compilers are worse at folding loads into memory operands for other instructions when you use volatile. I think that's why you're getting a separate mov load here; it's just a missed optimization.

(Although cmp [mem], imm might be less efficient. I forget if it can macro-fuse with a JCC or something. With a RIP-relative addressing mode it couldn't micro-fuse the load, but a register base is ok.)


cmp EAX, 0 is weird, I guess clang with optimization disabled doesn't look for test eax,eax as a peephole optimization for comparing against zero.

As @user3386109 commented, locked in a boolean context is equivalent to locked != 0 in C / C++.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
2

The compiler doesn't know about caching, it is not a caching thing, it tells the compiler that the value may change between accesses. So to functionally implement our code it needs to perform the accesses we ask for in the order we ask them. Can't optimize out.

void fun1 ( void )
{
    /* volatile */ int lock = 999;
    while (lock) continue;
}
void fun2 ( void )
{
    volatile int lock = 999;
    while (lock) continue;
}
volatile int vlock;
int ulock;
void fun3 ( void )
{
    while(vlock) continue;
}
void fun4 ( void )
{
    while(ulock) continue;
}
void fun5 ( void )
{
    vlock=3;
    vlock=4;
}
void fun6 ( void )
{
    ulock=3;
    ulock=4;
}

I find it easier to see in arm... doesn't really matter.

Disassembly of section .text:

00001000 <fun1>:
    1000:   eafffffe    b   1000 <fun1>

00001004 <fun2>:
    1004:   e59f3018    ldr r3, [pc, #24]   ; 1024 <fun2+0x20>
    1008:   e24dd008    sub sp, sp, #8
    100c:   e58d3004    str r3, [sp, #4]
    1010:   e59d3004    ldr r3, [sp, #4]
    1014:   e3530000    cmp r3, #0
    1018:   1afffffc    bne 1010 <fun2+0xc>
    101c:   e28dd008    add sp, sp, #8
    1020:   e12fff1e    bx  lr
    1024:   000003e7    andeq   r0, r0, r7, ror #7

00001028 <fun3>:
    1028:   e59f200c    ldr r2, [pc, #12]   ; 103c <fun3+0x14>
    102c:   e5923000    ldr r3, [r2]
    1030:   e3530000    cmp r3, #0
    1034:   012fff1e    bxeq    lr
    1038:   eafffffb    b   102c <fun3+0x4>
    103c:   00002000    

00001040 <fun4>:
    1040:   e59f3014    ldr r3, [pc, #20]   ; 105c <fun4+0x1c>
    1044:   e5933000    ldr r3, [r3]
    1048:   e3530000    cmp r3, #0
    104c:   012fff1e    bxeq    lr
    1050:   e3530000    cmp r3, #0
    1054:   012fff1e    bxeq    lr
    1058:   eafffffa    b   1048 <fun4+0x8>
    105c:   00002004    

00001060 <fun5>:
    1060:   e3a01003    mov r1, #3
    1064:   e3a02004    mov r2, #4
    1068:   e59f3008    ldr r3, [pc, #8]    ; 1078 <fun5+0x18>
    106c:   e5831000    str r1, [r3]
    1070:   e5832000    str r2, [r3]
    1074:   e12fff1e    bx  lr
    1078:   00002000    

0000107c <fun6>:
    107c:   e3a02004    mov r2, #4
    1080:   e59f3004    ldr r3, [pc, #4]    ; 108c <fun6+0x10>
    1084:   e5832000    str r2, [r3]
    1088:   e12fff1e    bx  lr
    108c:   00002004    

Disassembly of section .bss:

00002000 <vlock>:
    2000:   00000000    

00002004 <ulock>:
    2004:   00000000    

First one is the most telling:

00001000 <fun1>:
    1000:   eafffffe    b   1000 <fun1>

Being a local variable that is initialized, and non volatile then the compiler can assume it won't change value between accesses so it can never change in the while loop, so this is essentially a while 1 loop. If the initial value had been zero this would be a simple return as it can never be non-zero, being non-volatile.

fun2 being a local variable a stack frame needs to be built then.

It does what one assumes the code was trying to do, wait for this shared variable, one that can change during the loop

    1010:   e59d3004    ldr r3, [sp, #4]
    1014:   e3530000    cmp r3, #0
    1018:   1afffffc    bne 1010 <fun2+0xc>

so it samples it and tests what it samples each time through the loop.

fun3 and fun4 same deal but more realistic, as external to the function code isnt going to change lock, being non-global doesn't make much sense for your while loop.

102c:   e5923000    ldr r3, [r2]
1030:   e3530000    cmp r3, #0
1034:   012fff1e    bxeq    lr
1038:   eafffffb    b   102c <fun3+0x4>

For the volatile fun3 case the variable has to be read and tested each loop

1044:   e5933000    ldr r3, [r3]
1048:   e3530000    cmp r3, #0
104c:   012fff1e    bxeq    lr
1050:   e3530000    cmp r3, #0
1054:   012fff1e    bxeq    lr
1058:   eafffffa    b   1048 <fun4+0x8>

For the non-volatile being global it has to sample it once, very interesting what the compiler did here, have to think about why it would do that, but either way you can see that the "loop" retests the value read stored in a register (not cached) which will never change with a proper program. Functionally we asked it to only read the variable once by using non-volatile then it tests that value indefinitely.

fun5 and fun6 further demonstrate that volatile requires the compiler perform the accesses to the variable in its storage place before moving on to the next operation/access in the code. So when volatile we are asking the compiler to perform two assignments, two stores. When non-volatile the compiler can optimize out the first store and only do the last one as if you look at the code as a whole this function (fun6) leaves the variable set to 4, so the function leaves the variable set to 4.

The x86 solution is equally interesting repz retq is all over it (with the compiler on my computer), not hard to find out what that is all about.

Neither aarch64, x86, mips, riscv, msp430, pdp11 backends do the double check on fun3().

pdp11 is actually the easier code to read (no surprise there)

00000000 <_fun1>:
   0:   01ff            br  0 <_fun1>

00000002 <_fun2>:
   2:   65c6 fffe       add $-2, sp
   6:   15ce 03e7       mov $1747, (sp)
   a:   1380            mov (sp), r0
   c:   02fe            bne a <_fun2+0x8>
   e:   65c6 0002       add $2, sp
  12:   0087            rts pc

00000014 <_fun3>:
  14:   1dc0 0026       mov $3e <_vlock>, r0
  18:   02fd            bne 14 <_fun3>
  1a:   0087            rts pc

0000001c <_fun4>:
  1c:   1dc0 001c       mov $3c <_ulock>, r0
  20:   0bc0            tst r0
  22:   02fe            bne 20 <_fun4+0x4>
  24:   0087            rts pc

00000026 <_fun5>:
  26:   15f7 0003 0012  mov $3, $3e <_vlock>
  2c:   15f7 0004 000c  mov $4, $3e <_vlock>
  32:   0087            rts pc

00000034 <_fun6>:
  34:   15f7 0004 0002  mov $4, $3c <_ulock>
  3a:   0087            rts pc

(this is the not linked version)


cmp DWORD PTR [RSP - 8], 0 . <--- Why is the comparison done with 0 whilst DWORD PTR [RSP - 8] holds 999 within ?

while does a true false comparison meaning is it equal to zero or not equal to zero

Why is DWORD PTR [RSP - 8] copied into EAX and again why is the comparison done between 0 and EAX?

mov -0x8(%rsp),%eax
cmp 0,%eax
cmp 0,-0x8(%rsp)

as so.s -o so.o
so.s: Assembler messages:
so.s:3: Error: too many memory references for `cmp'

compare wants a register. So it reads into a register so it can do the compare as it can't do the compare between the immediate and the memory access in one instruction. If they could have done it in one instruction they would have.

halfer
  • 19,824
  • 17
  • 99
  • 186
old_timer
  • 69,149
  • 8
  • 89
  • 168
  • The last paragraph is wrong. x86 does have an instruction to compare immediate to memory. The reason the assembler gave an error is that an immediate needs a $ prefix. – prl Mar 13 '19 at 17:07