1

I would like to know the fastest way to do proportion calculation, that is y = x * a / b, where all values are 32 bit, unsigned, and a and b are fixed (initialized once, and then don't change) but not known at compile time. The result is guaranteed not to overflow (even thought the intermediate multiplication might need 64 bit). The programming language doesn't matter much, but Java would be best for my case. It needs to be as fast as possible (nanoseconds matter). I currently use:

int result = (int) ((long) x * a / b);

But division is slow. I know about Perform integer division using multiplication, so best would be a formula of type:

int result = (int) (((long) x * factor) >>> shift);

where factor and shift can be calculated from a and b (that calculation can be slow).

I tried to simply replace the division part of the original formula, but it doesn't work because the result of the two multiplications don't fit in 64 bit:

// init
int shift = 63 - Integer.numberOfLeadingZeros(b);
int factor = ((1L << shift) / b) + 1;
...
// actual calculation
int result = (int) ((long) x * a * factor) >>> shift);

The result doesn't actually have to be fully accurate in my case (off by one would be OK).

Community
  • 1
  • 1
Thomas Mueller
  • 48,905
  • 14
  • 116
  • 132
  • I have a (partial) solution now: `shift = 16; factor = ((long) a << 16) / b + 1;` - but it only works for some range of values. It would be nice to have a generic solution that works for all unsigned 32-bit values, if possible. – Thomas Mueller Oct 05 '16 at 13:05
  • Given that you know of integer division with invariant divisor, it is not clear where you are stuck. Given that `a, b, x` are all `uint32_t`, compute the unsigned `uint64_t` product `x*a`, then apply 64-bit division with constant divisor `b` to that. If `b` is a compile-time constant, simply use the division operator and let the compiler optimize it. Finally, cast the `uint64_t` quotient back to `uint32_t` (according to the specifications in the question, this is guaranteed to succeed by construction, no additional tests necessary). – njuffa Oct 05 '16 at 17:10
  • @njuffa Unfortunately the values are not known at compile time. I was struggling to find an elegant solution, with guaranteed behavior for edge cases (high values for either a, b, and x). I think I have it sorted out now. – Thomas Mueller Oct 06 '16 at 19:09

3 Answers3

2

What about

long a2 = a & 0xFFFFFFFFL;
long b2 = b & 0xFFFFFFFFL;
checkArgument(b2 > 0);
double dFactor = (double) a2 / b2;
shift = 0;
while (dFactor < 1L<<32) {
   dFactor *= 2;
   shift++;
}
factor = (long) dFactor;

for the preparation and

int result = (int) (((x & 0xFFFFFFFFL) * factor) >>> shift);

for the fast part? Now we have 2**32 <= factor < 2**33 and for any int x >= 0, the product x * factor < 2**31 * 2**33 = 2**64 just fits in an unsigned long. No bits get wasted. The conversion of dFactor to long rounds down, which may be suboptimal.


The preparation could surely be sped up, especially the loop can be eliminated by looking at the leading zeros first. I wouldn't bother with eliminating the double as it makes things simple.

maaartinus
  • 44,714
  • 32
  • 161
  • 320
  • Works all right (except for `a = 0`, in which case there is an endless loop). Thanks to your solution, I found a simpler solution with a fixed shift, and without using floating point. – Thomas Mueller Oct 06 '16 at 19:06
  • @ThomasMueller IMHO any fixed shift leads to lost precision. Eliminating `double` sounds easy, but I'm not sure if it's worth it as division is slow and using floating point doesn't change much. – maaartinus Oct 07 '16 at 13:35
0

I think it is not possible to always get the exact result if using the formula (x * factor) >>> shift: for some edge cases, either the result is 1 too low or 1 too high. To always get the correct result, the formula would need to be more complicated. I found a solution that does not require floating point, here a test case:

static final Set<Integer> SOME_VALUES = new TreeSet<Integer>();

static {
    Set<Integer> set = SOME_VALUES;
    for (int i = 0; i < 100; i++) {
        set.add(i);
    }
    set.add(Integer.MAX_VALUE);
    set.add(Integer.MAX_VALUE - 1);
    for (int i = 1; i > 0; i += i) {
        set.add(i - 1);
        set.add(i);
        set.add(i + 1);
    }
    for (int i = 1; i > 0; i *= 3) {
        set.add(i);
    }
    Random r = new Random(1);
    for (int i = 0; i < 100; i++) {
        set.add(r.nextInt(Integer.MAX_VALUE));
    }
}

private static void testMultiplyDelete() {
    for (int a : SOME_VALUES) {
        for (int b : SOME_VALUES) {
            if (b == 0) {
                continue;
            }
            int shift = 32;
            // sometimes 1 too low
            long factor = (1L << shift) * a / b;
            // sometimes 1 too high
            // long factor = ((1L << shift) * a / b) + 1;

            // sometimes 1 too low
            // double dFactor = (double) a / b;
            // int shift = 0;
            // while (dFactor > 0 && dFactor < (1L << 32)) {
            //     dFactor *= 2;
            //     shift++;
            // }
            // long factor = (long) dFactor;

            for (int x : SOME_VALUES) {
                long expectedResult = (long) x * a / b;
                if (expectedResult < 0 ||
                        expectedResult >= Integer.MAX_VALUE) {
                    continue;
                }
                int result = (int) ((x * factor) >>> shift);
                if (Math.abs(result - expectedResult) > 1) {
                    System.out.println(x + "*" + a + "/" + b +
                            "=" + expectedResult + "; " +
                            "(" + x + "*" + factor + ")>>>" + shift + "=" + result);
                }
            }
        }
    }
}
Thomas Mueller
  • 48,905
  • 14
  • 116
  • 132
-2

Since a and b are both fixed, you could just do the division once and reuse the result (this might already be happening automatically behind the scenes):

int c = a / b;
int y1 = x1 * c;
int y2 = x2 * c;
...

If you really need to optimize it, look into running it on the GPU (for example using java bindings for CUDA), which will allow you to parallelize the calculations, although this is a lot harder to implement.

Finally, it's always a good idea to add timers when testing so you can run benchmarks to make sure that the optimizations actually improve performance.

PandaConda
  • 3,396
  • 2
  • 20
  • 22
  • `int` isn't like a mathematical number - in particular, you can't do the division first and expect the same results. For example, if `a` is less than `b`, your code will always produce zero, rather than a correctly scaled quantity. – Toby Speight Jan 15 '18 at 17:00