12

I am running while loop in 4 thread, in the loop I am evaluating function and incrementally increasing counter.

while(1) {
    int fitness = EnergyFunction::evaluate(sequence);

    mutex.lock();
    counter++;
    mutex.unlock();
}

When I run this loop, as I said in 4 running threads, I get ~ 20 000 000 evaluations per second.

while(1) {
    if (dist(mt) == 0) {
        sequence[distDim(mt)] = -1;
    } else {
        sequence[distDim(mt)] = 1;
    }
    int fitness = EnergyFunction::evaluate(sequence);

    mainMTX.lock();
    overallGeneration++;
    mainMTX.unlock();
}

If I add some random mutation for the sequence, I get ~ 13 000 000 evaluations per second.

while(1) {
    if (dist(mt) == 0) {
        sequence[distDim(mt)] = -1;
    } else {
        sequence[distDim(mt)] = 1;
    }
    int fitness = EnergyFunction::evaluate(sequence);

    mainMTX.lock();
    if(fitness < overallFitness)
        overallFitness = fitness;

    overallGeneration++;
    mainMTX.unlock();
}

But when I add simple if statement that checks, if new fitness is smaller than old fitness if that is true then replace old fitness with new fitness.

But performance loss is massive! Now I get ~ 20 000 evaluations per second. If I remove random mutation part, I also get ~ 20 000 evaluations per second.

Variable overallFitness is declared as

extern int overallFitness; 

I am having troubles figuring out what is the problem for such a big performance loss. Is comparing two int such time taking operation?

Also I don't believe that is related to mutex locking.

UPDATE

This performance loss was not because of branch prediction, but compiler just ignored this call int fitness = EnergyFunction::evaluate(sequence);.

Now I added volatile and compiler doesn't ignore the call anymore.

Also thank you for pointing out branch misprediction and atomic<int>, didn't know about them!

Because of atomic I also remove mutex part, so the final code looks like this:

while(1) {
    sequence[distDim(mt)] = lookup_Table[dist(mt)];
    fitness = EnergyFunction::evaluate(sequence);
    if(fitness < overallFitness)
       overallFitness = fitness;
    ++overallGeneration;
}

Now I am getting ~ 25 000 evaluations per second.

TomazStoiljkovic
  • 808
  • 8
  • 21
  • 7
    One possible reason [is branch misprediction](http://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-an-unsorted-array) – R_Kapp Oct 29 '15 at 13:14
  • 3
    @R_Kapp Nah, that’s unrelated. This here is simply lock contention. – Konrad Rudolph Oct 29 '15 at 13:14
  • 5
    `atomic` may help. – Jarod42 Oct 29 '15 at 13:15
  • @KonradRudolph: Fair enough; I don't know enough about mutex locking, so I took the OP's word for it at "I don't believe that is related to mutex locking" – R_Kapp Oct 29 '15 at 13:15
  • 10
    Actually writing `.lock()` and `.unlock()` is an anti-pattern - use a `lock_guard`. – Barry Oct 29 '15 at 13:16
  • 5
    @R_Kapp Actually it might be somewhat related. The main problem here is of course the cross-thread access pattern: Not only reading but actually *writing* a global variable and pushing updates to all caches is simply a performance killer. But the fact that this is much slower than the unconditional update is probably related to how instructions can or cannot be cached depending on that value. – Konrad Rudolph Oct 29 '15 at 13:16
  • 4
    If the problem is rooted in writing a global variable (as KonradRudolph mentioned, and is very plausible), you could remedy it by storing *local* best fitness in the loop, and only after the loop update the global fitness if it is higher than the local fitness. – king_nak Oct 29 '15 at 13:44
  • 2
    As a general suggestion, the shorter your test case is, the easier it is for others to help you out. The "random mutation part" is a red herring, as you even show. You should just delete that part of your original question. – Mark Lakata Oct 29 '15 at 16:21
  • I suspect you may be misunderstanding your own results. You talk about how many "evaluations per second" you're getting. But what you're really measuring is how many mutex acquire/release cycles per second you're getting. – David Schwartz Oct 29 '15 at 17:02
  • The design is probably faulty as well. You seem to look for a maximum value across threads. This is correctly done by calculating a maximum per thread, and only when each thread is finished is this per-thread local maximum compared to the global maximum. – MSalters Oct 29 '15 at 20:04
  • @KonradRudolph: if this were a case of speculative execution gone wrong, as R_Kapp suggested, would the penalty be that big? – Mister Smith Oct 29 '15 at 21:51
  • 1
    @MisterSmith Probably not but it’s *really* hard to say. However, as noted in several answers, what the first code is actually measuring is simply the cost of a lock. since the function call is optimised out. And *that’s* why it’s so much faster. – Konrad Rudolph Oct 29 '15 at 22:02
  • Is there any change if you replace your fitness/overallFitness compare/assign with an InterlockedCompareExchange()? – Eric Towers Oct 30 '15 at 05:38
  • 1
    This is not a duplicate of the suggested [branch misprediction](http://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-an-unsorted-array?lq=1). Mark [nailed the reason](http://stackoverflow.com/a/33416731/923794). – cfi Oct 30 '15 at 07:31

4 Answers4

31

You need to run a profiler to get to the bottom of this. On Linux, use perf.

My guess is that EnergyFunction::evaluate() is being entirely optimized away, because in the first examples, you don't use the result. So the compiler can discard the whole thing. You can try writing the return value to a volatile variable, which should force the compiler or linker to not optimize the call away. 1000x speed up is definitely not attributable to a simple comparison.

Mark Lakata
  • 19,989
  • 5
  • 106
  • 123
  • 1
    This is probably the reason, gcc without O1 or higher just optimizes that function call away. – fireant Oct 29 '15 at 14:32
  • 1
    @Peter Sorry Peter. I just missed your answer :) I will add you to the credits in my answer too. Good catch anyway! – Alex Lop. Oct 29 '15 at 16:53
11

There is actually an atomic instruction to increase an int by 1. So a smart compiler may be able to entirely remove the mutex, altough I'd be surprised if it did. You can test this by looking at the assembly, or by removing the mutex and changing the type of overallGeneration to atomic<int> an check how fast it still is. This optimization is no longer possible with your last, slow example.

Also, if the compiler can see that evaluate does nothing to the global state and the result isn't used, then it can skip the entire call to evaluate. You can find out if that's the case by looking at the assembly or by removing the call to EnergyFunction::evaluate(sequence) and look at the timing - if it doesn't speed up, the function wasn't called in the first place. This optimization is no longer possible with your last, slow example. You should be able to stop the compiler from not executing EnergyFunction::evaluate(sequence) by defining the function in a different object file (other cpp or library) and disabling link time optimization.

There are other effects here that also create a performance difference, but I can't see any other effects that can explain a difference of factor 1000. A factor 1000 usually means the compiler cheated in the previous test and the change now prevents it from cheating.

Peter
  • 5,608
  • 1
  • 24
  • 43
  • Atomic instructions do not help here (at least not a fundamentally), because updating the variable *still* requires invalidating the local core caches and broadcasting the value to all cores. And that’s still inefficient (if I remember correctly, roughly a factor 1000 slower than a local variable read). – Konrad Rudolph Oct 29 '15 at 16:25
  • @KonradRudolph The atomic increment instruction is potentially orders of magnitude cheaper than waiting on an already locked mutex, which is probably an atomic compareexchange, followed by a kernel call - it will differ based on the implementation of the mutex. The problem is more that compilers are very careful about optimizing around memory barriers, so they are very unlikely to touch the mutex code. Therefore the second option I suggested is the more likely cause. – Peter Oct 29 '15 at 16:46
  • 2
    All that is correct but the fact that this code has this dependency between the threads at all is a much bigger problem, in my opinion, and should be solved algorithmically. Meaning, you can save potentially much more by restructuring the code rather than just getting rid of the single mutex in favour of atomic operations. You seem to know what you’re doing but many people expect atomic operations to perform magic, when in reality they are more on par with a very cheap mutex. They don’t actually fix wonky compute dependencies. – Konrad Rudolph Oct 29 '15 at 16:48
  • 2
    @KonradRudolph You're on the question how to optimize the code. The question was why the performance changed. I guess we agree that a mutex that isn't congested should cost about 2 atomic operations to lock and unlock, a mutex that is congested costs much more, and a proper design shouldn't have congested mutexes. – Peter Oct 29 '15 at 16:57
7

I am not sure that my answer will give an explanation for such a dramatic performance drop but it definitely may have impact on it.

In the first case you added branches to the non-critical area:

if (dist(mt) == 0) {
    sequence[distDim(mt)] = -1;
} else {
    sequence[distDim(mt)] = 1;
}

In this case the CPU (at least IA) will perform branch prediction and in case of branch miss-prediction there is a performance penalty - this is a known fact.

Now regarding the second addition, you added a branch to the critical area:

mainMTX.lock();
if(fitness < overallFitness)
    overallFitness = fitness;

overallGeneration++;
mainMTX.unlock();

Which in its turn, in addition to the "miss-prediction" penalty increased the amount of code which is executed in that area and thus the probability that other threads will have to wait for mainMTX.unlock();.

NOTE

Please make sure that all the global/shared resources are defined as volatile. Otherwise the compiler may optimize them out (which might explain such high number of evaluations at the very beginning).

In case of overallFitness it probably won't be optimized out because it is declared as extern but overallGeneration may be optimized out. If this is the case, then it may explain this performance drop after adding the "real" memory access in the critical area.

NOTE2

I am still not sure that the explanation I provided may explain such significant performance drop. So I believe there might be some implementation details in the code which you didn't post (like volatile for example).

EDIT

As Peter (@Peter) Mark Lakata (@MarkLakata) stated in the separate answers, and I tend to agree with them, most likely that the reason for the performance drop is that in the first case fitness was never used so the compiler just optimized that variable out together with the function call. While in the second case fitness was used so the compiler didn't optimize it. Good catch Peter and Mark! I just missed that point.

Alex Lop.
  • 6,810
  • 1
  • 26
  • 45
  • "Please make sure that all the global/shared resources are defined as `volatile`." The mutex should take care of that already, it introduces memory barriers. – JAB Oct 29 '15 at 20:48
  • @JAB No, mutex only protects the shared resource but the compiler may for example move this resource from the memory to a register and operate with it instead. Volatile tells the compiler not to optimize that variable and each time it is accessed, that access should be performed on the memory. – Alex Lop. Oct 29 '15 at 21:11
2

I realize this is not strictly an answer to the question but an alternative to the problem presented as it was.

Is overallGeneration used while the code is running? That is, is it perhaps used to determine when to stop computation? If it is not, you could forego synchronizing a global counter and have a counter per thread and after the computation is done, sum up all the per-thread counters to a grand total. Similarly for overallFitness, you could keep track of maxFitness per thread and pick the maximum of the four results once computation is over.

Having no thread synchronization at all would get you 100% CPU usage.

screwnut
  • 1,367
  • 7
  • 26