The compiler doesn't know about caching, it is not a caching thing, it tells the compiler that the value may change between accesses. So to functionally implement our code it needs to perform the accesses we ask for in the order we ask them. Can't optimize out.
void fun1 ( void )
{
/* volatile */ int lock = 999;
while (lock) continue;
}
void fun2 ( void )
{
volatile int lock = 999;
while (lock) continue;
}
volatile int vlock;
int ulock;
void fun3 ( void )
{
while(vlock) continue;
}
void fun4 ( void )
{
while(ulock) continue;
}
void fun5 ( void )
{
vlock=3;
vlock=4;
}
void fun6 ( void )
{
ulock=3;
ulock=4;
}
I find it easier to see in arm... doesn't really matter.
Disassembly of section .text:
00001000 <fun1>:
1000: eafffffe b 1000 <fun1>
00001004 <fun2>:
1004: e59f3018 ldr r3, [pc, #24] ; 1024 <fun2+0x20>
1008: e24dd008 sub sp, sp, #8
100c: e58d3004 str r3, [sp, #4]
1010: e59d3004 ldr r3, [sp, #4]
1014: e3530000 cmp r3, #0
1018: 1afffffc bne 1010 <fun2+0xc>
101c: e28dd008 add sp, sp, #8
1020: e12fff1e bx lr
1024: 000003e7 andeq r0, r0, r7, ror #7
00001028 <fun3>:
1028: e59f200c ldr r2, [pc, #12] ; 103c <fun3+0x14>
102c: e5923000 ldr r3, [r2]
1030: e3530000 cmp r3, #0
1034: 012fff1e bxeq lr
1038: eafffffb b 102c <fun3+0x4>
103c: 00002000
00001040 <fun4>:
1040: e59f3014 ldr r3, [pc, #20] ; 105c <fun4+0x1c>
1044: e5933000 ldr r3, [r3]
1048: e3530000 cmp r3, #0
104c: 012fff1e bxeq lr
1050: e3530000 cmp r3, #0
1054: 012fff1e bxeq lr
1058: eafffffa b 1048 <fun4+0x8>
105c: 00002004
00001060 <fun5>:
1060: e3a01003 mov r1, #3
1064: e3a02004 mov r2, #4
1068: e59f3008 ldr r3, [pc, #8] ; 1078 <fun5+0x18>
106c: e5831000 str r1, [r3]
1070: e5832000 str r2, [r3]
1074: e12fff1e bx lr
1078: 00002000
0000107c <fun6>:
107c: e3a02004 mov r2, #4
1080: e59f3004 ldr r3, [pc, #4] ; 108c <fun6+0x10>
1084: e5832000 str r2, [r3]
1088: e12fff1e bx lr
108c: 00002004
Disassembly of section .bss:
00002000 <vlock>:
2000: 00000000
00002004 <ulock>:
2004: 00000000
First one is the most telling:
00001000 <fun1>:
1000: eafffffe b 1000 <fun1>
Being a local variable that is initialized, and non volatile then the compiler can assume it won't change value between accesses so it can never change in the while loop, so this is essentially a while 1 loop. If the initial value had been zero this would be a simple return as it can never be non-zero, being non-volatile.
fun2 being a local variable a stack frame needs to be built then.
It does what one assumes the code was trying to do, wait for this shared variable, one that can change during the loop
1010: e59d3004 ldr r3, [sp, #4]
1014: e3530000 cmp r3, #0
1018: 1afffffc bne 1010 <fun2+0xc>
so it samples it and tests what it samples each time through the loop.
fun3 and fun4 same deal but more realistic, as external to the function code isnt going to change lock, being non-global doesn't make much sense for your while loop.
102c: e5923000 ldr r3, [r2]
1030: e3530000 cmp r3, #0
1034: 012fff1e bxeq lr
1038: eafffffb b 102c <fun3+0x4>
For the volatile fun3 case the variable has to be read and tested each loop
1044: e5933000 ldr r3, [r3]
1048: e3530000 cmp r3, #0
104c: 012fff1e bxeq lr
1050: e3530000 cmp r3, #0
1054: 012fff1e bxeq lr
1058: eafffffa b 1048 <fun4+0x8>
For the non-volatile being global it has to sample it once, very interesting what the compiler did here, have to think about why it would do that, but either way you can see that the "loop" retests the value read stored in a register (not cached) which will never change with a proper program. Functionally we asked it to only read the variable once by using non-volatile then it tests that value indefinitely.
fun5 and fun6 further demonstrate that volatile requires the compiler perform the accesses to the variable in its storage place before moving on to the next operation/access in the code. So when volatile we are asking the compiler to perform two assignments, two stores. When non-volatile the compiler can optimize out the first store and only do the last one as if you look at the code as a whole this function (fun6) leaves the variable set to 4, so the function leaves the variable set to 4.
The x86 solution is equally interesting repz retq is all over it (with the compiler on my computer), not hard to find out what that is all about.
Neither aarch64, x86, mips, riscv, msp430, pdp11 backends do the double check on fun3().
pdp11 is actually the easier code to read (no surprise there)
00000000 <_fun1>:
0: 01ff br 0 <_fun1>
00000002 <_fun2>:
2: 65c6 fffe add $-2, sp
6: 15ce 03e7 mov $1747, (sp)
a: 1380 mov (sp), r0
c: 02fe bne a <_fun2+0x8>
e: 65c6 0002 add $2, sp
12: 0087 rts pc
00000014 <_fun3>:
14: 1dc0 0026 mov $3e <_vlock>, r0
18: 02fd bne 14 <_fun3>
1a: 0087 rts pc
0000001c <_fun4>:
1c: 1dc0 001c mov $3c <_ulock>, r0
20: 0bc0 tst r0
22: 02fe bne 20 <_fun4+0x4>
24: 0087 rts pc
00000026 <_fun5>:
26: 15f7 0003 0012 mov $3, $3e <_vlock>
2c: 15f7 0004 000c mov $4, $3e <_vlock>
32: 0087 rts pc
00000034 <_fun6>:
34: 15f7 0004 0002 mov $4, $3c <_ulock>
3a: 0087 rts pc
(this is the not linked version)
cmp DWORD PTR [RSP - 8], 0 . <--- Why is the comparison done with 0 whilst DWORD PTR [RSP - 8] holds 999 within ?
while does a true false comparison meaning is it equal to zero or not equal to zero
Why is DWORD PTR [RSP - 8] copied into EAX and again why is the comparison done between 0 and EAX?
mov -0x8(%rsp),%eax
cmp 0,%eax
cmp 0,-0x8(%rsp)
as so.s -o so.o
so.s: Assembler messages:
so.s:3: Error: too many memory references for `cmp'
compare wants a register. So it reads into a register so it can do the compare as it can't do the compare between the immediate and the memory access in one instruction. If they could have done it in one instruction they would have.