9

The Short Version

In the following line:

aData[i] = aData[i] + ( aOn * sin( i ) );

If aOn is 0 or 1, does the processor actually perform the multiplication, or does it conditionally work out the result (0 for 0, other-value for 1)?

The Long Version

I'm looking into algorithm performance consistency, which partly involves a look into the effect of Branch Prediction.

The hypothesis is that this code:

for ( i = 0; i < iNumSamples; i++ )
    aData[i] = aData[i] + ( aOn * sin( i ) );

will provide more stable performance than this code (where branch prediction may destabilise performance):

for ( i = 0; i < iNumSamples; i++ )
{
    if ( aOn )
        aData[i] = aData[i] + sin( i );
}

with aOn being either 0 or 1, and it can toggle during the loop execution by another thread.

The actual conditional calculation (+ sin( i ) in the example above) involves more processing and the if condition must be within the loop (there are multitude of conditions, not just one like in the example above; also, changes to aOn should have effect immediately and not per loop).

Ignoring performance consistency, the performance tradeoff between the two options is in the time it takes to execute the if statement and that of a multiplication.

Regardless, it is easy to spot that if a processor would not perform the actual multiplication for values like 1 and 0, the first option could be a win-win solution (no branch prediction, better performance).

Community
  • 1
  • 1
Izhaki
  • 23,372
  • 9
  • 69
  • 107
  • It's likely that the compiler would optimize those out. – Reinstate Monica -- notmaynard Jul 08 '13 at 22:27
  • I have check with and without optimisation and it makes no difference to the relative performance between multipliers of `0`, `1`, and any other numbers (optimisation does improve performance slightly for all cases though). So how exactly optimisation relates to whether or not the processor does the multiplication? – Izhaki Jul 08 '13 at 22:30
  • @iamnotmaynard The compiler cannot optimize it out unless aOn is a constant, but the OP has not given any indication that it is. – Jim Balter Jul 09 '13 at 01:57

1 Answers1

8

Processors perform regular multiplication with 0s and 1s.

Reason is, that if the processor would check for 0 and 1 before each calculation, the introduction of the condition will take more cycles. While you would gain performance for 0 and 1 multipliers, you will lose performance for any other values (which are much more probable).

A simple program can prove this:

#include <iostream>
#include "cycle.h"
#include "time.h"

void Loop( float aCoefficient )
{
    float iSum = 0.0f;

    clock_t iStart, iEnd;

    iStart = clock();
    for ( int i = 0; i < 100000000; i++ )
    {
        iSum += aCoefficient * rand();
    }
    iEnd = clock();
    printf("Coefficient: %f: %li clock ticks\n", aCoefficient, iEnd - iStart );
}

int main(int argc, const char * argv[])
{
    Loop( 0.0f );
    Loop( 1.0f );
    Loop( 0.25f );

    return 0;
}

For which the output is:

Coefficient: 0.000000: 1380620 clock ticks
Coefficient: 1.000000: 1375345 clock ticks
Coefficient: 0.250000: 1374483 clock ticks 
Izhaki
  • 23,372
  • 9
  • 69
  • 107
  • What optimisation level did you use, what compiler, and did you try it with a print of the resulting `iSum` - more importantly, perhaps using something a little less intensive than `rand()`, which in itself does a bunch of rather complicated math and could be hiding the result (and since the compiler HAS to call `rand()` since `rand()` has side effects - it modifies internal state), it can't optimise that out, no matter what. – Mats Petersson Jul 08 '13 at 22:54
  • Great points. The relative performance remains the same regardless of the optimisation level (from none to `-Os`); The compiler is LLVM C++; without `rand()` printing `iSum` is needed to prevent the optimiser skipping the loop altogether; trying variations without `rand()`, for example replacing it with `sin( i )` still yields similar results for all 3 multipliers. – Izhaki Jul 08 '13 at 23:13
  • I think it's POSSIBLE to get a scenario where the compiler pre-calculates the 0.0 or 1.0 into "don't multiply", but at the same time, it's only possible where the compiler can see the value of the coefficient as a constant. And depending on the type of processor [and the predictability of the value during the run - in other words, how often it changes], it may or may not be better to have an `if` vs. a plain multiplication. ... continuing... – Mats Petersson Jul 08 '13 at 23:18
  • If a second thread is modifying `aOn`, technically, it should be protected so that the two threads can't update and read simultaneously. In practice _and_ unless the precise detection `on` or `off` is importat, and the processor is an x86 (or selected other types), it may be OK to not do that. – Mats Petersson Jul 08 '13 at 23:20
  • The specifics of the actual implementation involve the GUI thread (ie, the user) changing `aOn` while another thread is doing the data processing. The loop is actually executed time and again for hours. The precise detection of `on` or `off` is not important on a single loop iteration scale (ie, it is OK to only update at i+1 - it is after all the highly imprecise GUI thread that triggers the change). So I can't see why read/update lock is required. But changes must take affect within the loop and not per loop. – Izhaki Jul 08 '13 at 23:27
  • Although not discussed here. A third solution is to have a linked list of algorithms where `aOn` essentially adds or removes the algorithm from the list. But this solution is far more complicated; involves complex thread safety; and has no relationship to the orignal question. – Izhaki Jul 08 '13 at 23:30
  • Well, technically, one thread reading some data that is updated by another thread, and not protected by a lock of some sort, is undefined behaviour according to the C++ standard. Which is why I mentioned it. But as long as it's a processor that either only has one core, or has "sane" handling of cache content, it should work OK to update it in one thread and read it in another under the premise of "exact detection not necessary". (tbc) – Mats Petersson Jul 08 '13 at 23:38
  • But if the "on" or "off" is very rare, perhaps detecting it on the outside of the loop may be another option - in which case locks would work - and you could skip the loop altogether if it's "off". – Mats Petersson Jul 08 '13 at 23:39
  • "the introduction of the condition will take more cycles" -- modern CPUs execute a lot of logic in parallel. The CPU's multiplication logic may well check for operands of 0 and 1 and take shortcuts in those cases. That the machine you tested on doesn't do that doesn't mean that no machine does that. – Jim Balter Jul 09 '13 at 02:03