C optimization: conditional store to avoid dirtying a cache line

Question

In the libuv source, I found this code:

  /* The if statement lets the compiler compile it to a conditional store.
   * Avoids dirtying a cache line.
   */
  if (loop->stop_flag != 0)
    loop->stop_flag = 0;

Can someone explain this a bit?

What exactly is a cache line?

Also, I guess a conditional store is some Assembler instruction which checks something and if succeeded, writes some value. Right?

When does such construct makes sense? I guess not always, because otherwise the compiler would just always use the conditional store, right?

Gene · Accepted Answer · 2014-03-28T16:39:09.613

9

A cache is organized in blocks of fast memory that, for historical reasons, are called lines. When you write to a cache line, it is marked as "dirty," meaning a bit is set in the cache controller hardware signifying that the line needs to be copied to other levels of cache and/or main memory before some other part of the system can access it.

In general, each level of the memory hierarchy: registers, L1, L2, L3 ... cache, main memory, and swap space has different copies of the same information. Making sure that different parts of the system (processors, DMA, video subsystem, etc.) see the same value even though one or more copies may have changed is called the coherency problem.

The general solution is pausing to copy an updated value to different levels of the hierarchy. This is called a flush.

A flush can cost 10's to - in the worst case when it causes a page fault - perhaps millions of processor cycles.

Due to this high cost, hardware designers go to great lengths to minimize the need for flushes. Here the programmer has taken up the cause, too.

The comment is saying "if the cache already contains a zero in the flag, let's not write a zero over the zero because this will mark the cache line dirty, which may cause an unnecessary flush."

"Conditional store" is a slightly obscure term. It just refers to a jump on zero around the normal store, which is the code a compiler will produce from the if statement. In X86 it will look something like:

    ;; assume edi holds value of pointer 'loop'
    ;; and flag is a constant offset for the 'stop_flag' field.
    cmp dword ptr [edi, flag], 0
    jz no_store
    mov [edi, flag], 0
no_store:
   ... code continues

With the if statement missing, you'd have only the last mov instruction.

NB A commenter pointed out that there do exist single "conditional move/store" instructions on important processor architectures. I have not seen gcc produce one.

Whether this is a worthwhile optimization is very debatable. Conditionals have their own risks of flushing the instruction pipeline (a different kind of flush). Never sacrifice clarity for speed without clear evidence that it's needed.

edited Mar 28 '14 at 16:39

answered Mar 28 '14 at 15:36

Gene

46,253
4
58
96

So, about my last question: This sounds like if the address of the value is in the main memory, it's almost always better to use such a conditional store. Why doesn't the compiler do that automatically? – Albert Mar 28 '14 at 15:43
Because it only is an optimization, if you can guarantee the value is in (unmodified) cache. Otherwise it's a pessimization, at worst forcing a main memory access. – Deduplicator Mar 28 '14 at 15:51
1

Compiler optimise if it's always right. testing then allocating need one more test, and a conditional jump. It will cost 2 (at least) CPU cycle,and one of the precious pipeline guess. Compiler can't guess if it's worth the speed of the cache. – ColdCat Mar 28 '14 at 15:52
BTW: A flush can never cause a page fault. That happened earlier. THis optimization really comes into it's own with multiprocessing: Cache-ping-pong kills performance like you won't believe. – Deduplicator Mar 28 '14 at 15:52
@Deduplicator I don't get that. It can if the cache architecture allows cache lines to persist for swapped out pages. In that case a flush can only be completed if the page image is first swapped in. – Gene Mar 28 '14 at 16:07
That would be a basic violation of the cache hierarchy. Can you name any system allowing such? If cache-coherency works, your io-system reads the whole page on swapout, and gets to own the page on swapin. Anything else leads to mass breakage. – Deduplicator Mar 28 '14 at 16:13
2

Regarding conditional store: That's an instruction of many architectures, not a couple of instructions in sequence. Read your assembler manual. Special attention to `CMOVNZ` please. – Deduplicator Mar 28 '14 at 16:17
Another follow-up question: Wouldn't the compiler maybe just optimize that `if`-check away? Because it could conclude that it has no effect and just always write the value. – Albert Mar 28 '14 at 16:25
That's a quality-of-implementation issue, and cannot be answered in general. It's a good one though. – Deduplicator Mar 28 '14 at 16:27
1

@Albert It could if there were code above that actually guarantees that the value is zero, but in general there won't be. More interesting would be the possibility for the cache hardware to detect that the write does not change the value, so marking dirty is unnecessary. – Gene Mar 28 '14 at 16:28
@Deduplicator On cache hierarchy, I said it was a worst case. A VM-side cache allows this without a "basic violation of the cache hierarchy". I agree this is a research topic, not common practice. On CMOVNZ, my gcc does not generate the instruction at -O4. I used the sequence it generated. – Gene Mar 28 '14 at 16:36
Only a software cache, but that's a completely different layer. If I'm wrong, please enlighten me and point me to an example. – Deduplicator Mar 28 '14 at 16:40
2

@Deduplicator CMOVCC is only a memory to register instruction so it's not applicable in this case. – ColdCat Mar 28 '14 at 17:30
@ColdCat: Now I'm curious: How do you deduce that? Without: test+jmp+move With: test+CMOVEcc – Deduplicator Mar 28 '14 at 17:35
1

@Deduplicator cmovcc result is always in a register. So you have to do a mov to write it to memory. So if you want to optimise write cache it's useless. – ColdCat Mar 28 '14 at 17:55

Deduplicator · Answer 2 · 2014-03-28T21:04:28.937

"to cache" means to hide something. The function of a Cache in computing is to hide the distance to main memory, by preempting main memory accesses as much as possible.

That only works if you used the data before, you have not pushed it out of cache yet, and nobody else took it away before you. Any other actor (other CPU, IO-Bus, ...) must be able to get the current value and change it, even if you have it cached. This task is accomplished using cache coherency protocols. Higher coherency means higher cost.

What your code tries to do, is make the compiler emit a conditional move, so the CPU checks for 0, and only writes if not 0. There's a whole family of conditional move instructions in the Intel/AMD IS and many others.

So now, step for step:

Test for 0: If your CPU does not have a copy of the tested data, it must ask for one. That's far worse than before. Let's hope you don't hit main memory.
Prepare to write a value:
- You own the data: Wonderful, you are already done.
- You don't own the data: The Cache calls its brethren and higher layers to inform them that it now owns this piece. Nobody else can retain a copy.
Write a value: The cache saves the change and marks the cacheline (Lowest granularity of cache) dirty, need to write back.

So, is it worth it? It depends.

Aside: Why provide conditional store instruction, if you can synthesise them using a conditional jump and a store? The advantages are using less instructions and no risk of flushing the instruction pipeline (partly executed following instructions). UPDATE: Looks like they cannot move from register/immediate to memory on x86/x86_64.

Only a few ISAs can do a conditional store without branching. For example ARM can predicate any instruction. As you noted in your update, [x86 `cmov reg, reg/mem`](https://www.felixcloutier.com/x86/cmovcc) can only write to a reg, and it's an unconditional load to feed an ALU select operation if you use it with a memory source. It also doesn't allow an immediate source operand, unfortunately. See also [How does x86 handle store conditional instructions?](https://stackoverflow.com/a/75430225) . (But here you would want to branch instead of playing with non-`lock`ed `cmpxchg` or `rep stosd` — Peter Cordes, Feb 12 '23 at 21:33
@PeterCordes In the specific example silent store optimization (in hardware) could provide the same effect. (I do not know what should happen for a silent to a MONITORed or transactional memory location.) — , Feb 13 '23 at 18:45

C optimization: conditional store to avoid dirtying a cache line

2 Answers2

Linked

Related