Locking-Free queue, loaded vs unloded CPU

Question

My CPU is Corei7 ( 4 physical cores/ 8 logical). I am testing my implementation of free-locking queue. What is the test? I just create a lot of threads ( "pusher" and "popper") and they push/pop elements. I noticed that it works much faster when... CPU is loaded. So, when the CPU isn't relatively loaded it works slower ( relatively much). And, when it is loaded it works faster.

How to understand it? I think that it is caused by the fact, that "popper" and "pusher" have to race to "head/"tail". ( I mean incrementation of node because of the memory managment). If there is less popper/pusher then counter is lower. But, please note that it works free-locking in fact ( I think so :) )
Does it mean that I should send thread in some situation to sleep for 2 ms ? Maybe 10 ms ( it's so long time for CPU). I am not sure.

Are you saying that if you run the test on an empty CPU, it runs slower than if you run the exact same test while doing something in the background? Also, source code would be essential for debugging any unusual timing. — Cort Ammon, Aug 24 '16 at 22:02
Yes, indeed. I can attach sourcecode, but it is so long that I think it will be discouraging for readers. ( as in my previous (deleted) post). — Gilgamesz, Aug 24 '16 at 22:06
How long do you run your test for? Maybe the CPUs are in a power-saving mode when you run your test on an idle system, and you don't run the test for long enough to bring the CPU(s) out of power-saving mode. Link to Google search results on the subject: https://www.google.com/search?q=corei7+power+saving+modes — Andrew Henle, Aug 24 '16 at 23:49
@AndrewHenle: [this Q&A](http://stackoverflow.com/questions/38299023/why-does-this-delay-loop-start-to-run-faster-after-several-iterations-with-no-sl/38300014#38300014) has more details on power-saving / turbo affecting benchmarks. — Peter Cordes, Aug 26 '16 at 01:41

Peter Cordes · Accepted Answer · 2016-08-26T01:48:20.160

2

Bouncing cache lines between cores is expensive. It sounds reasonable that one core could push/pop more than 4x faster than 4 cores contending with each other while they try to modify the same memory.

So it sounds like the problem is in deciding what changes in the total wall-clock time or total CPU time of all the threads tell you about whether your code is good or not.

To put it another way: You're testing the maximum-contention case, where your threads are spending all their time pushing and popping, and not doing any other work. In real code that used this queue, the other work done by a thread would throttle the access rate to your queue, probably a lot, so threads would step on each other's toes a lot less. (Contention probably causes a significant performance hit with cmpxchg loops, because only one CPU will succeed each time, and all the rest will retry every time.)

Related: Locks Aren't Slow; Lock Contention Is (Jeff Preshing) makes the same points for testing parallel algorithms that use locks in high vs. low contention cases.

Maybe try benchmarking with some threads doing read-only access

Lock-free algorithms really shine when a lot of the accesses are reads. I guess a queue is normally popped, not just read, so maybe this doesn't make sense for real use. But I bet you'd see different results if some of your threads were just reading the shared queue instead of updating it. (e.g. traverse it from head to tail as a linked list).

Another interesting thing to try, in the write code: add a pause instruction ( _mm_pause()) before a load from shared memory somewhere in your benchmark, to avoid memory-ordering mis-speculation. (i.e. where the CPU speculatively uses a value loaded before the load is allowed to become globally visible, and then has to roll back when it turns out the value was changed by another core by the time the load was supposed to become globally visible). Keep in mind that pause sleeps for ~5 cycles on Haswell, but ~100 cycles on Skylake, so even if you see a speedup from it in a non-synthetic benchmark on Haswell, it's probably a bad idea to leave it in for real use on future CPUs.

Note that pause isn't useful before locked read-modify-write instructions; they're already expecting writes from other cores.

Normally you do a relaxed load an then a cmpxchg, so I'm suggesting putting a pause before the load.

edited Aug 26 '16 at 01:48

answered Aug 25 '16 at 00:36

Peter Cordes

328,167
45
605
847

Ok, so we have to accept performance hit in maximus-contention cases? "Contention probably causes a significant performance hit with cmpxchg loops, because only one CPU will succeed each time, and all the rest will retry every time.)". I think so too. "to avoid memory-ordering mis-speculation". I cannot understand what do you mean here. After all, x86 can only reorder StoreLoad operation but I use sequential ( default) semantic on my atomics so compiler is forced to place fences ( like `mfences`). So, please me explain or reference me somewhere. ( I am gonna read given link later ) – Gilgamesz Aug 25 '16 at 08:03
1

@Gilgamesz: Right, an x86 CPU has to make memory operations become globally visible within the ISA's ordering requirements. But internally it can *speculatively* reorder things for more performance, and detect + roll-back if it would break the rules. Normally this is a win, because loads/stores of heavily-contended shared memory are rare. – Peter Cordes Aug 25 '16 at 08:08
1

You didn't say you used `mo_seq_cst` for your stores. That might increase the cost of contention. – Peter Cordes Aug 25 '16 at 08:09
"That might increase the cost of contention.". I know that I have to change this semantic to weaker, but at that moment I have other problems :). " internally it can speculatively reorder things for more performance". What does it mean- speculatively reorder? Afer all, every store ( even speculative) must be globally visible so it can be a problem ( misspeculation). I can imagine how works speculativly executed a branch. – Gilgamesz Aug 25 '16 at 08:19
1

@Gilgamesz: Right, you can't speculatively store, but you can speculatively load. If it turns out the value you fed into the out-of-order engine has changed by the time the load should be globally visible, you have to roll back when you detect the mis-speculation. – Peter Cordes Aug 25 '16 at 08:22
Thanks @PeterCordes: now it is more clear. It is interesting how the CPU know that mispeculated value was changed and it is necessary to flush pipeline. Can you reference me somewhere I can read about it ( If you know about such literature of course). – Gilgamesz Aug 25 '16 at 08:38
@Gilgamesz: I've had a hard time googling for good links when I mention it in answers. e.g. I found https://en.wikipedia.org/wiki/Memory_dependence_prediction, but that's about speculating about store addresses within a single core. The other hits for "memory order mis-speculation" were mostly books. But I've definitely seen it said that one of the major benefits of `pause` is that it tells the CPU not to do this speculation, so you don't have a mis-speculation just as you're breaking out of a spin-loop. There's a perf counter for it. – Peter Cordes Aug 25 '16 at 08:44
IIRC, I included a link about the relevant perf counter in a paragraph about False Sharing in [this answer](http://stackoverflow.com/questions/37361145/deoptimizing-a-program-for-the-pipeline-in-intel-sandybridge-family-cpus/37362225#37362225). Yup: http://www.jaist.ac.jp/iscenter-new/mpc/altix/altixdata/opt/intel/vtune/doc/users_guide/mergedProjects/analyzer_ec/mergedProjects/reference_olh/pentium4_hh/er4/memory_order_machine_clear_performance_impact.htm. Of course, that's for P4, but SnB-family has a similar counter. – Peter Cordes Aug 25 '16 at 08:45

Locking-Free queue, loaded vs unloded CPU

1 Answers1