17

I was trying to create my own factorial function when I found that the that the calculation is twice as fast if it is calculated in pairs. Like this:

Groups of 1: 2*3*4 ... 50000*50001 = 4.1 seconds

Groups of 2: (2*3)*(4*5)*(6*7) ... (50000*50001) = 2.0 seconds

Groups of 3: (2*3*4)*(5*6*7) ... (49999*50000*50001) = 4.8 seconds

Here is the c# I used to test this.

Stopwatch timer = new Stopwatch();
timer.Start();

// Seperate the calculation into groups of this size.
int k = 2;

BigInteger total = 1;

// Iterates from 2 to 50002, but instead of incrementing 'i' by one, it increments it 'k' times,
// and the inner loop calculates the product of 'i' to 'i+k', and multiplies 'total' by that result.
for (var i = 2; i < 50000 + 2; i += k)
{
    BigInteger partialTotal = 1;
    for (var j = 0; j < k; j++)
    {
        // Stops if it exceeds 50000.
        if (i + j >= 50000) break;
        partialTotal *= i + j;
    }
    total *= partialTotal;
}

Console.WriteLine(timer.ElapsedMilliseconds / 1000.0 + "s");

I tested this at different levels and put the average times over a few tests in a bar graph. I expected it to become more efficient as I increased the number of groups, but 3 was the least efficient and 4 had no improvement over groups of 1.

Bar Plot showing the calculation time with different group amounts.

Link to First Data

Link to Second Data

What causes this difference, and is there an optimal way to calculate this?

Will Hamic
  • 672
  • 4
  • 20
  • How many times did you run the test for each group size? – Dan Puzey Aug 22 '16 at 19:23
  • Why not `for(int j = i; j < i + k; j++)` and then `partialTotal *= j;`? You could even calculate the sum of `i +k` into a variable to avoid doing it multiple times in the loop. – juharr Aug 22 '16 at 19:24
  • Quite possibly because by doing them in pairs you reduce loop overhead by 50%. Also, make sure that when you run your test you're doing it with a release build and that the debugger is not attached (i.e. Ctrl+F5 or "Debug -> Start without Debugging"). Running with the debugger attached is going to give you very skewed results. – Jim Mischel Aug 22 '16 at 19:25
  • Really not sure that criticizing the choice of algorithm is productive for the question at hand... :-) – Dan Puzey Aug 22 '16 at 19:27
  • @junharr I ran the test 3 times for each group size, there was little variation in the times it took to run, typically around 0.1 seconds. I am adding a link to a spreadsheet with data. – Will Hamic Aug 22 '16 at 19:28
  • 1
    I'd suspect that the results of multiplying pairs are in the range of `int` most of the time, while multiplying sequentially or in triples involves *big* numbers more often. – dejvuth Aug 22 '16 at 19:31
  • Good point: `int * int` is `int`, but once you hit the `BigInteger` values you're working with larger numbers. That doesn't explain why 2 specifically is so much faster... @WillHamic what happens if you rerun the test declaring `i` and `j` explicitly as `BigInteger` instead of using `var` (which will declare integers)? – Dan Puzey Aug 22 '16 at 19:34
  • I just ran this on my own machine and took it out to a group size k of 50. My results started like yours but were drastically different after k = 4. They kept decreasing and settled on a range of 1.00-1.10. Here are a few random samples from the figures: k = 1: 2.22, 2: 1.06, 3: 2.57, 4: 2.0, 7: 1.70, 10: 1.55, 20: 1.20, 30: 1.18, 40: 1.01, 50: 1.04 –  Aug 22 '16 at 19:35
  • On second thought, maybe my results aren't that different; I think my machine is just faster. I see the downward curve on your graph as well, even with the larger groups. –  Aug 22 '16 at 19:37
  • This is all over the board. I fiddled with the code more, and changed `50000` to `100000`. Results so far: k = 1: 11.14, 2: 14.307, 3: 11.39, 4: 11.18. Skip a few... 11: 6.1, 13: 6.0, 20: 5.32... **Note that k = 2 took the longest.** –  Aug 22 '16 at 19:42
  • 1
    It is worth noting that using `long` instead of `BigInteger` makes the performance measurable only in ticks... It seems like this is an implementation detail of `BigInteger` more than anything. – Dan Puzey Aug 22 '16 at 19:47
  • Well, it's not the debug/release thing. I get the same percentage difference in results regardless of debug/release and run with or without debugger attached. Likely @DanPuzey is right about it being an artifact of `BigInteger` implementation. – Jim Mischel Aug 22 '16 at 19:53
  • Where is Jon Skeet when you need him? –  Aug 22 '16 at 19:56
  • I just ran this on my own machine averaging over 50 iterations. `k = 2` is not the fastest on the list. I think it's reasonably clear that this is a result of poor benchmarking; voting to close. I'll rescind the vote if the test is run over a more indicative sample size and still shows markedly interesting results, but I find it unlikely. – Dan Puzey Aug 22 '16 at 20:05
  • You want speed go after parallel – paparazzo Aug 22 '16 at 20:07
  • @Paparazzi really not the point of the question. (Also, it's not like multiplying 50000 numbers together is really going to benefit from running multithreaded. More useful might be: if you want [more] speed, don't use `BigInteger`.) – Dan Puzey Aug 22 '16 at 20:12
  • @DanPuzey Really running on 4 cpu versus one is no help. Based on your comment I do believe the way you would try would not be faster. – paparazzo Aug 22 '16 at 20:17
  • @Paparazzi Running with `long` instead of `BigInteger` moves the problem from ~1second to under 1 millisecond on my machine. That's on a single core. You could parallelise using BigInteger and it might or might not make a difference, but this is really not answering the OP's question anyway. – Dan Puzey Aug 22 '16 at 20:21
  • @Paparazzi: I'd also note that, since you're multplying `total` by the value in each case, you're likely to add complexity to the code _and_ lose time to thread synchronisation in order to get a working solution. Really not sure it's as simple as you suggest! – Dan Puzey Aug 22 '16 at 20:23
  • @DanPuzey So long moves the problem - that has nothing to be my comment. 4 core is faster than 1 any day. And maybe is is not as difficult as you suggest. I break more complex sequence into parallel than this in about 20 lines. – paparazzo Aug 22 '16 at 20:24
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/121570/discussion-between-dan-puzey-and-paparazzi). – Dan Puzey Aug 22 '16 at 20:26
  • @Paparazzi: I don't find your suggestion to be backed up by the facts: running parallel has pushed my execution time from 1-2 seconds up to 6+ seconds. More cores does not always mean faster processing. Show me an implementation that works and we could discuss from there. – Dan Puzey Aug 22 '16 at 20:28
  • @DanPuzey I've added a link to a google sheet with more data, which includes testing for different groups and different factorial levels. I am not sure how to analyze the data effectively. – Will Hamic Aug 22 '16 at 20:55

3 Answers3

6

BigInteger has a fast case for numbers of 31 bits or less. When you do a pairwise multiplication, this means a specific fast-path is taken, that multiplies the values into a single ulong and sets the value more explicitly:

public void Mul(ref BigIntegerBuilder reg1, ref BigIntegerBuilder reg2) {
  ...
  if (reg1._iuLast == 0) {
    if (reg2._iuLast == 0)
      Set((ulong)reg1._uSmall * reg2._uSmall);
    else {
      ...
    }
  }
  else if (reg2._iuLast == 0) {
    ...
  }
  else {
    ...
  }
}
public void Set(ulong uu) {
  uint uHi = NumericsHelpers.GetHi(uu);
  if (uHi == 0) {
    _uSmall = NumericsHelpers.GetLo(uu);
    _iuLast = 0;
  }
  else {
    SetSizeLazy(2);
    _rgu[0] = (uint)uu;
    _rgu[1] = uHi;
  }
  AssertValid(true);
}

A 100% predictable branch like this is perfect for a JIT, and this fast-path should get optimized extremely well. It's possible that _rgu[0] and _rgu[1] are even inlined. This is extremely cheap, so effectively cuts down the number of real operations by a factor of two.

So why is a group of three so much slower? It's obvious that it should be slower than for k = 2; you have far fewer optimized multiplications. More interesting is why it's slower than k = 1. This is easily explained by the fact that the outer multiplication of total now hits the slow path. For k = 2 this impact is mitigated by halving the number of multiplies and the potential inlining of the array.

However, these factors do not help k = 3, and in fact the slow case hurts k = 3 a lot more. The second multiplication in the k = 3 case hits this case

  if (reg1._iuLast == 0) {
    ...
  }
  else if (reg2._iuLast == 0) {
    Load(ref reg1, 1);
    Mul(reg2._uSmall);
  }
  else {
    ...
  }

which allocates

  EnsureWritable(1);

  uint uCarry = 0;
  for (int iu = 0; iu <= _iuLast; iu++)
    uCarry = MulCarry(ref _rgu[iu], u, uCarry);

  if (uCarry != 0) {
    SetSizeKeep(_iuLast + 2, 0);
    _rgu[_iuLast] = uCarry;
  }

why does this matter? Well, EnsureWritable(1) causes

uint[] rgu = new uint[_iuLast + 1 + cuExtra];

so rgu becomes length 3. The number of passes in total's code is decided in

public void Mul(ref BigIntegerBuilder reg1, ref BigIntegerBuilder reg2)

as

    for (int iu1 = 0; iu1 < cu1; iu1++) {
      ...
      for (int iu2 = 0; iu2 < cu2; iu2++, iuRes++)
        uCarry = AddMulCarry(ref _rgu[iuRes], uCur, rgu2[iu2], uCarry);
      ...
    }

which means that we have a total of len(total._rgu) * 3 operations. This hasn't saved us anything! There are only len(total._rgu) * 1 passes for k = 1 - we just do it three times!

There is actually an optimization on the outer loop that reduces this back down to len(total._rgu) * 2:

      uint uCur = rgu1[iu1];
      if (uCur == 0)
        continue;

However, they "optimize" this optimization in a way that hurts far more than before:

if (reg1.CuNonZero <= reg2.CuNonZero) {
  rgu1 = reg1._rgu; cu1 = reg1._iuLast + 1;
  rgu2 = reg2._rgu; cu2 = reg2._iuLast + 1;
}
else {
  rgu1 = reg2._rgu; cu1 = reg2._iuLast + 1;
  rgu2 = reg1._rgu; cu2 = reg1._iuLast + 1;
}

For k = 2, that causes the outer loop to be over total, since reg2 contains no zero values with high probability. This is great because total is way longer than partialTotal, so the fewer passes the better. For k = 3, the EnsureWritable(1) will always cause a spare space because the multiplication of three numbers no more than 15 bits long can never exceed 64 bits. This means that, although we still only do one pass over total for k = 2, we do two for k = 3!

This starts to explain why the speed increases again beyond k = 3: the number of passes per addition increases slower than the number of additions decreases, as you're only adding ~15 bits to the inner value each time. The inner multiplications are fast relative to the massive total multiplications, so the more time spent consolidating values, the more time saved in passes over total. Further, the optimization is less frequently a pessimism.

It also explains why odd values take longer: they add an extra 32-bit integer to the _rgu array. This won't happen so cleanly if the ~15 bits wasn't so close to half of 32.


It's worth noting that there are a lot of ways to improve this code; the comments here are about why, not how to fix it. The easiest improvement would be to chuck the values in a heap and multiply only the two smallest values at a time.

Veedrac
  • 58,273
  • 15
  • 112
  • 169
4

The time required to do a BigInteger multiplication depends on the size of the product.

Both methods take the same number of multiplications, but if you multiply the factors in pairs, then the average size of the product is much smaller than it is if you multiply each factor with the product of all the smaller ones.

You can do even better if you always multiply the two smallest factors (original factors or intermediate products) that have yet to be multiplied together, until you get to the complete product.

Matt Timmermans
  • 53,709
  • 3
  • 46
  • 87
  • 1
    How does this explain the sharp dip in execution time where k = 2, though? –  Aug 22 '16 at 19:45
  • I think you messed up the cases for k > 2 :) – Matt Timmermans Aug 22 '16 at 19:46
  • 2
    So... the code looks OK to me, AND there is probably a change in the internal representation of a BigInteger when it gets bigger than 64 bits, which will happen with 3 numbers < 50000, but not 2. However, I would still expect multiplying in triples to be much faster than singles, so I suspect that there is something wrong with the way you're doing the benchmark. It's pretty hard to do little benchmarks properly in a complex JIT-compiled environment like the CLI or JVM. I could be wrong, though. There are other possibilities. Maybe the CLI is really smart when optimizing the inner loop. – Matt Timmermans Aug 22 '16 at 19:54
  • @WillHamic I agree with this answer (+1). Take a look at [Fast exact bigint factorial](http://stackoverflow.com/a/18333853/2521214) so you got something to compare with. – Spektre Aug 23 '16 at 08:49
-5

I think you have a bug ('+' instead of '*').

partialTotal *= i + j;

Good to check that you are getting the right answer, not just interesting performance metrics.

But I'm curious what motivated you to try this. If you do find a difference, I would expect it would have to do with optimalities in register and/or memory allocation. And I would expect it would be 0-30% or something like that, not 50%.

Mike Layton
  • 93
  • 1
  • 6
  • No the it's suppose to be `i + j` as `i` will be in this case even numbers and `j` will be 0 and then 1 in the inner loop. – juharr Aug 22 '16 at 19:21
  • Not that is correct. And I did make sure all the results were the same. `i` is the current position of the outer loop, and `j` is the current position of the inner loop. So if it is set to groups of 4, and the outer loop is currently at 40, it would multiply, (40+0)*(40+1)*(40+2)*(40+3). – Will Hamic Aug 22 '16 at 19:24
  • OK, great, but you completely ignored the other part of my answer. Note that your code might wind up as some kind of "vector" calculation that the compiler recognizes and optimizes with "multi-media" instructions or such. I'd be digging into the assembly to clarify, if it is important. My guess would be that the inner loop doesn't exist as such in the 2 case. (I don't recall you said what compiler/architecture/optimization so I'm left to speculate on that.) – Mike Layton Aug 22 '16 at 19:48