Strange optimization? in `libuv`. Please explain

Question

The libuv contains next code in core.c:uv_run()

/* The if statement lets the compiler compile it to a conditional store.
 * Avoids dirtying a cache line.
 */
if (loop->stop_flag != 0)
    loop->stop_flag = 0;

What does this mean? Is it some kind of optimization? Why they did not simply assign 0?

May be for you, but not for me. When and how this works? Is this still actual for modern C++ compilers? — kyb, Dec 09 '16 at 19:14
They do not want to modify the memory if it is already set to 0. — MathiasE, Dec 09 '16 at 19:15
@ThomasMatthews I guess it can be asked in a form "won't the compiler take care of it by itself ?" Or the opposite, won't it optimize it to a simple assignment? — Eugene Sh., Dec 09 '16 at 19:17
This has been asked and has additional answers here: https://stackoverflow.com/questions/22717010/c-optimization-conditional-store-to-avoid-dirtying-a-cache-line — ZachB, Apr 21 '19 at 04:03

score 5 · Answer 1 · answered Dec 09 '16 at 19:14

5

Yes, just like the comment says. In case the flag is already 0, there is no need to write any data to the memory, thus avoiding a possible eviction of present data in the cache and replacing it with 0 for the flag. This will provide added value only in extremely time-critical applications.

answered Dec 09 '16 at 19:14

SomeWittyUsername

18,025
3
42
85

Is this to avoid "false sharing" ? – Borgleader Dec 09 '16 at 19:17
I don't think it's an eviction issue, it's a write back issue (when it's a dirty line). – hesham_EE Dec 09 '16 at 19:18
The processor may need to read the variable, so the variable may be in the cache or the processor may have to put the variable in the cache to read it. – Thomas Matthews Dec 09 '16 at 19:19
I wonder if writing the same value will actually dirtify it. Probably depends on the hardware. – Eugene Sh. Dec 09 '16 at 19:19
Now what is the cost of the branch vs an unconditional write? It would be interesting to see what would be faster. – NathanOliver Dec 09 '16 at 19:25
@NathanOliver Well, in conjunction with branch prediction it might be zero cost. – Eugene Sh. Dec 09 '16 at 19:28
@EugeneSh. Yep. I'm just curious if that happens in practice or not. If the flag is often `0` or not `0` then I would suspect it would be faster. – NathanOliver Dec 09 '16 at 19:30
@NathanOliver Depends on multiple factors, including the frequency of 0 being already in place, the processor architecture (which directly related to the cost of branching), the type and speed of different cache types, the execution environment (i.e., maybe there is no need to evict anything) and probably more stuff – SomeWittyUsername Dec 10 '16 at 06:15

score 5 · Answer 2 · answered Dec 09 '16 at 19:28

5

I would argue this optimization is bad. For example, on gcc with -O3 it gives following code:

foo():
        movl    stop_flag(%rip), %eax
        testl   %eax, %eax
        je      .L3
        movl    $0, stop_flag(%rip)
.L3:
        ret
stop_flag:
        .zero   4

As you see, there is no conditional move, but a branch. And I am sure, branch misprediction is far worse than dirtying the cache line.

answered Dec 09 '16 at 19:28

SergeyA

61,605
5
78
137

Branching is very heavy and worse than assigning values. This is common knowledge to CPU archs. But as @EugeneSh. dramatically pointed out, you must provide a link for OP to see. – Dellowar Dec 09 '16 at 19:32
@EugeneSh., I do not have the hard proof, but it seems logical. After all, nothing stops CPU from not invalidating the cache at all if it sees the value didn't actually change - but misprediction is always a possibility. – SergeyA Dec 09 '16 at 19:32
7

A write to a memory location can be a lot slower than a branch misprediction. If the core only has shared ownership of the cacheline, it must broadcast an invalidation request to all other cores to invalidate their copies. And if any other core has it in the modified state, it must send back the cacheline and merge it with the current core. Since this involves multiple back-and-forths across the core-interconnect, we're talking latencies greater than a cache-miss (hundreds of cycles) as opposed to 10-ish for a branch misprediction. – Mysticial Dec 09 '16 at 19:36
@SergeyA If the branch is taken (not taken) most of the times, the branch predictor would be accurate in most of the cases, so for the large number of iterations the optimization might gain. But of course it depends on the whole picture. – Eugene Sh. Dec 09 '16 at 19:36
2

That said, I'm unsure if writes that are generated from mispredictions will go all the way through the entire cache coherency path and hit the penalty anyway. I'd expect modern processors are smart enough to suppress such accesses until the instruction is no longer in speculation. – Mysticial Dec 09 '16 at 19:37
1

@Mysticial Given that the value is *read*, you get the memory access *anyway* unless the line is in cache. Plus, the processor may first request the line as `shared` in the coherence protocol, so you may get *twice* the coherence overhead this way. Add the fact that writes can be buffered without stalling the processor and and I'm *really* skeptical about this "optimization". – EOF Dec 09 '16 at 20:30
1

@EOF I used to think the same thing about writes not mattering. But when I tested it, it revealed otherwise. I suspect two things: 1) The reorder buffer isn't large enough to hide a write miss. Reorder buffers are on the order 200 instructions today. And the core can sustain 2 - 4 inst/cycle. That's not enough to hide an instruction stalled for 200 cycles even if there's nothing depending on it. 2) x86 requires acquire/release semantics for most loads and stores. So stores are committed in program order which basically forces the OOE engine to do all the reordering. – Mysticial Dec 09 '16 at 20:45
@Mysticial: I don't think the reorder buffer matters for writes. The *store buffer* may matter if you're storing a lot of things to uncached locations. The memory model should not be too much of a problem, since later reads can pass the writes. If these *do* slow the program down, I'd try making the write non-temporal instead. – EOF Dec 09 '16 at 20:49
@EOF Regarding your point about the read already bringing it into cache. It is indeed very close if there's only 1 core running the code in question. If you have multiple cores running, then they will be unnecessarily invalidating each other's copies of the cache line when it suffices to have then all in the shared state. – Mysticial Dec 09 '16 at 20:53
2

@Mysticial If the value is concurrently modified by another thread/process, the behavior is *undefined* anyway, since `loop->stop_flag` is not atomic. – EOF Dec 09 '16 at 20:58
@EOF I just looked at the #'s for Skylake. 224 reorder window, 56 store buffer. But there are no details whether the 224 reorder window *includes* store instructions. Or if store instructions can leave the reorder window before it leaves the store buffer. So that would depend on how many stores the code has. If the code saturates stores 1/cycle, then 56 definitely can't hide 200. – Mysticial Dec 09 '16 at 20:59
After thinking about this a bit more. I think you're right. I can only think of case where it would be beneficial and I'm not even sure if it applies. If the cacheline is already in the shared state. The read-only version is fast. In the always-write case, it would need to broadcast the invalidations and get into the exclusive state. I'm unsure if you need to wait for acks. Since some synchronization might be needed to resolve a race where multiple cores (both in shared state) simultaneously write. In all other cases, the bandwidth cost of the dirty line write back is negligible for a flag. – Mysticial Dec 09 '16 at 21:46
@SergeyA - "nothing stops CPU from not invalidating the cache at all if it sees the value didn't actually change" - that's not how cache works. The newly written 0 doesn't have to arrive to the exact same location of previous zero in the cache. In fact it almost always won't, because of the vast differences between the sizes of main memory and cache. Additionally, comparing if the value changed would require either a dedicated HW or more CPU usage (which would contradict the motivation to use cache in the first place) – SomeWittyUsername Dec 10 '16 at 06:24

Strange optimization? in `libuv`. Please explain

2 Answers2

Linked