0

Context of this is a function, which needs to run pretty much once per frame, and is therefore very critical performance-wise. This function contains a loop, and operations inside it.

private int MyFunction(int number)
{
    // Code
    for (int i = 0; i <= 10000; i++)
    {
        var value = i * number
        var valuePow2 = value * value;

        // Some code which uses valuePow2 several times
    }
    return 0; // Not actual line
}

Now, because of mathematical properties, we know that (a * b)² is equal to a² * b²

So, it would be possible to make my function into this:

private int MyFunction(int number)
{
    // Code
    var numberPow2 = number * number;
    for (int i = 0; i <= 10000; i++)
    {
        var iPow2 = i * i
        var valuePow2 = numberPow2 * iPow2;

        // Some code which uses valuePow2 several times
    }
    return 0; // Not actual line
}

intuitively, this seems like it should be faster, since number² does not vary, and is now only calculated once outside of the loop. At the very least, this would be much faster for a human to do, because the x² operation is done on a much smaller number during the loop.

What I am wondering, is in C#, when you use types like int, will the multiplication actually be faster with smaller numbers?

For example, will 5 * 5 execute faster than 5000 * 5000?

If so, then the second version is better, even if by a small margin, because of that.

But if, for a given data type, the time is constant, then the first version of the function is better, because half of the calculations will be done on smaller numbers, because I do the same amount of multiplication in the loop both times, but in the second version I do one extra multiplication before the start.

I am aware that for all intent and purposes, the performance difference is negligible. I was suggested the second version in a Code Review because the function is critical, and I can't find any documentation to support either view.

Kaito Kid
  • 983
  • 4
  • 15
  • 34
  • 3
    You can't optimize based on intuition, guesses, reviews, or documentation. Use a profiler to measure performance. – Dour High Arch Mar 14 '19 at 18:45
  • Why aren't you using `Math.Pow`? This is about as fast as it gets. – Zer0 Mar 14 '19 at 18:48
  • 1
    There should be no performance difference based on the magnitude of a number contained in a numeric type with basic operations. Fundamentally, your two loops are the same: They each have two multiplications. Consider using LINQPad to time small segments of code, but realize accurate timing isn't that simple. Also, without profiling, you may be optimizing the wrong code. – NetMage Mar 14 '19 at 19:04
  • 1
    @Zer0 Because from what I've seen, Math.Pow is conserably slower than straight multiplication, when the power is known and is 2 https://stackoverflow.com/questions/936541/math-pow-vs-multiply-operator-performance – Kaito Kid Mar 14 '19 at 19:04
  • In your case, multiplication and `Math.Pow` aren't equivalent, because you aren't using `double`. `Math.Pow` is only implemented for doubles. You're using `int`. For you to use `Math.Pow`, you would first have to convert to double, then convert back when you're done. That's two or three conversion operations that do not need to be made with straight multiplication. –  Mar 14 '19 at 19:19
  • I believe that the speed would be the same in both of your cases because both would have the same time complexity. No matter the size of the **integer**, multiplication will always be the same speed. – Jake Stewart Mar 14 '19 at 19:34

2 Answers2

3

For example, will 5 * 5 execute faster than 5000 * 5000?

For compile-time constants, 5 * x is cheaper than 5000 * x because the former can be done with lea eax, [rdi + rdi*4].

But for runtime variables, the only integer instruction with data-dependent performance is division. This applies on any mainstream CPU: pipelining is so important that even if some cases could run with lower latency, they typically don't because that makes scheduling harder. (You can't have the same execution unit produce 2 results in the same cycle; instead the CPU just wants to know that putting inputs in on one cycle will definitely result in the answer coming out 3 cycles later.)

(For FP, again only division and sqrt have data-dependent performance on normal CPUs.)

Code using integers or FP that has any data-dependent branching can be much slower if the branches go a different way. (e.g. branch prediction is "trained" on one sequence of jumps for a binary search; searching with another key will be slower because it will mispredict at least once.)

And for the record, suggestions to use Math.Pow instead of integer * are insane. Simply converting an integer to double and back is slower than multiplying by itself with integer multiply.


Adam's answer links a benchmark that's looping over a big array, with auto-vectorization possible. SSE / AVX2 only has 32-bit integer multiply. And 64-bit takes more memory bandwidth. That's also why it shows speedups for 16 and 8-bit integers. So it finds c=a*b running at half speed on a Haswell CPU, but that does not apply to your loop case.

In scalar code, imul r64, r64 has identical performance to imul r32, r32 on Intel mainstream CPUs (since at least Nehalem), and on Ryzen (https://agner.org/optimize/). Both 1 uop, 3 cycle latency, 1/clock throughput.

It's only AMD Bulldozer-family, and AMD Atom and Silvermont, where 64-bit scalar multiply is slower. (Assuming 64-bit mode of course! In 32-bit mode, working with 64-bit integers is slower.)


Optimizing your loop

For a fixed value of number, instead of recalculating i*number, compilers can and will optimize this to inum += number. This is called a strength-reduction optimization, because addition is a "weaker" (slightly cheaper) operation than multiplication.

for(...) {
    var value = i * number
    var valuePow2 = value * value;
}

can be compiled into asm that does something like

var value = 0;
for(...) {
    var valuePow2 = value * value;

    ...

    value += number;
}

You might try writing it by hand that way, in case the compiler isn't doing it for you.

But integer multiplication is very cheap and fully pipelined on modern CPUs, especially. It has slightly higher latency than add, and can run on fewer ports (usually only 1 per clock throughput instead of 4 for add), but you say you're doing significant work with valuePow2. That should let out-of-order execution hide the latency.


If you check the asm and the compiler is using a separate loop counter incrementing by 1, you could also try to hand-hold your compiler into optimizing the loop to use value as the loop counter.


var maxval = number * 10000;
for (var value = 0; i <= maxval; value += number) {
    var valuePow2 = value * value;

    ...
}

Be careful if number*10000 can overflow, if you need it to wrap correctly. In that case this loop would run far fewer iterations. (Unless number is so big that value += number also wraps...)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
1

For a typical processor, multiplying two 32-bit integers will take the same amount of cycles regardless of data in those integers. Most current processors will take nearly twice the time to multiply 64-bit integers as it takes to multiply 32-bit integers.

I did notice a problem in both of your codes. When you multiply two ints, it returns an type int. The var type will set the type to the return value. That means, valuePow2 will be an int. Since your loop goes up to 10000, if number is 5 or greater, then you will overflow valuePow2.

If you don't want to overflow your int, you could change your code to

private int MyFunction(int number)
{
    // Code
    for (int i = 0; i <= 10000; i++)
    {
        long value = i * number;        //64bit multiplication          
        long valuePow2 = value * value; //64bit multiplication

        // Some code which uses valuePow2 several times
    }
    return 0; // Not actual line
}

the modified code should be faster because you may change a 64bit multiplication into a 32bit multiplication

private int MyFunction(int number)
{
    // Code
    long numberPow2 = number * number; //64bit multiplication
    for (int i = 0; i <= 10000; i++)
    {
        int iPow2 = i * i;                      //32bit multiplication
        long valuePow2 = numberPow2 * iPow2;    //64bit multiplication

        // Some code which uses valuePow2 several times
    }
    return 0; // Not actual line
}

But the circuitry in the CPU and the optimization of the compiler can change how many cycles this ends up running. At the end of the day, you said it best:

I am aware that for all intent and purposes, the performance difference is negligible.

Adam
  • 414
  • 2
  • 8
  • That is a very interesting point that you raise, with the potential of turning 64bit multiplication into 32 bits. Unfortunately, even though it works for the code in my question, it has been considerably simplified (and changed) for the question, and it isn't possible for my actual case. I will keep that information in mind though, because I may run into cases where it works in the future, or be able to modify one for it to work. – Kaito Kid Mar 14 '19 at 19:23
  • 1
    @KaitoKid: the benchmark that Adam linked is looping over a big array, with auto-vectorization possible. (And SSE / AVX2 only has 32-bit integer multiply. And 64-bit takes more memory bandwidth). In scalar code, `imul r64, r64` has identical performance to `imul r32, r32` on Intel mainstream CPUs (since at least Nehalem), and on Ryzen (https://agner.og/optimize/). Both 1 uop, 3 cycle latency, 1/clock throughput. It's only AMD Bulldozer-family where 64-bit scalar multiply is slower. (Assuming 64-bit mode of course!) – Peter Cordes Mar 14 '19 at 23:29