What's the fastest way to divide an integer by 3?

Question

int x = n / 3;  // <-- make this faster

// for instance

int a = n * 3; // <-- normal integer multiplication

int b = (n << 1) + n; // <-- potentially faster multiplication

score 126 · Answer 1 · answered Oct 05 '08 at 02:27

126

The guy who said "leave it to the compiler" was right, but I don't have the "reputation" to mod him up or comment. I asked gcc to compile int test(int a) { return a / 3; } for an ix86 and then disassembled the output. Just for academic interest, what it's doing is roughly multiplying by 0x55555556 and then taking the top 32 bits of the 64 bit result of that. You can demonstrate this to yourself with eg:

$ ruby -e 'puts(60000 * 0x55555556 >> 32)'
20000
$ ruby -e 'puts(72 * 0x55555556 >> 32)'
24
$

The wikipedia page on Montgomery division is hard to read but fortunately the compiler guys have done it so you don't have to.

answered Oct 05 '08 at 02:27

Martin Dorey

2,944
2
24
16

11

This is easier to understand if you just call it "the reciprocal, stored in fixed-point" – Ben Voigt Feb 04 '14 at 01:04
This isn't Montgomery division, It's more like the idea behind Barrett reduction. – Kyle Butt Sep 22 '17 at 03:01
This is by far the best answer, in my opinion. And I love how easy it is to choose the precision. For example, I needed one that worked on an 8-bit input. It was simple: (x * 0x56) >> 8; – All The Rage Feb 28 '20 at 19:27
@AlltheRage `(x * 0x56) >> 8` is not a good choice for 8 bit input. For example, 251*86/256 = 84 which doesn't sound right. – jpalecek Apr 12 '20 at 17:39
2

I guess I misspoke because my number range is really only 6 bits, but are stored in 8. To make it work for all 8 bit numbers, just choose more precision. This works, for example: (x * 0x556) >> 12 – All The Rage Apr 13 '20 at 20:08

score 60 · Accepted Answer · answered Oct 05 '08 at 01:30

60

This is the fastest as the compiler will optimize it if it can depending on the output processor.

int a;
int b;

a = some value;
b = a / 3;

answered Oct 05 '08 at 01:30

KPexEA

16,560
16
61
78

This is unfortunately what I thought as well. – Greg Dean Oct 05 '08 at 01:36
On second thought I think what you are tying to say is that the following. If "some value" is known ahead of time the compiler will optimize the evaluation of some value/3. However I am interested in the case where some value is determined at runtime – Greg Dean Oct 05 '08 at 02:09
3

turns out regardless of some value being known the compiler will optimize it to something similar to n * 0x55555556 >> 32 thanks – Greg Dean Oct 05 '08 at 02:43
1

The most straightforward way is usually the best, isn't it? – mxg Oct 05 '08 at 04:55
2

If `a` is a signed type, but is known to be positive, `(unsigned)a/3` will likely be faster because when dividing a signed type the compiler will add extra code to ensure that negative values yield a truncate-toward-zero result rather than the naturally-computed floored division result. – supercat Nov 13 '13 at 21:44
8

I feel like answers such as "let the compiler do it" are too flippant because in this age most of the people who would ask a question like this are compiler writers, hardware designers, or people who have found that the compiler did a crappy job. Basically we should have respect for the people who post questions on here. They generally need an answer with lots of thinking behind it. (Although I acknowledge that giving all of the potentially good answers, including the flippant, can be even more helpful.) – All The Rage Feb 28 '20 at 00:22

Aaron Murgatroyd · Answer 3 · 2012-07-30T11:44:27.600

There is a faster way to do it if you know the ranges of the values, for example, if you are dividing a signed integer by 3 and you know the range of the value to be divided is 0 to 768, then you can multiply it by a factor and shift it to the left by a power of 2 to that factor divided by 3.

eg.

Range 0 -> 768

you could use shifting of 10 bits, which multiplying by 1024, you want to divide by 3 so your multiplier should be 1024 / 3 = 341,

so you can now use (x * 341) >> 10
(Make sure the shift is a signed shift if using signed integers), also make sure the shift is an actually shift and not a bit ROLL

This will effectively divide the value 3, and will run at about 1.6 times the speed as a natural divide by 3 on a standard x86 / x64 CPU.

Of course the only reason you can make this optimization when the compiler cant is because the compiler does not know the maximum range of X and therefore cannot make this determination, but you as the programmer can.

Sometime it may even be more beneficial to move the value into a larger value and then do the same thing, ie. if you have an int of full range you could make it an 64-bit value and then do the multiply and shift instead of dividing by 3.

I had to do this recently to speed up image processing, i needed to find the average of 3 color channels, each color channel with a byte range (0 - 255). red green and blue.

At first i just simply used:

avg = (r + g + b) / 3;

(So r + g + b has a maximum of 768 and a minimum of 0, because each channel is a byte 0 - 255)

After millions of iterations the entire operation took 36 milliseconds.

I changed the line to:

avg = (r + g + b) * 341 >> 10;

And that took it down to 22 milliseconds, its amazing what can be done with a little ingenuity.

This speed up occurred in C# even though I had optimisations turned on and was running the program natively without debugging info and not through the IDE.

Thank you for your thorough explanation. I was missing the point up until now. — Timo, Jul 02 '15 at 19:05
Nice! I had a divide by 5 problem that I was anal about - averaging rgba values of a pixel and it's neighbors to north south west and east (poor man's wicked fast blur). I settled for a good approximation using the proportion 50/256 which is ~0.195. It's beautiful, you can count the clock cycles. `p[i] = ((a << 5) + (a << 4) + (a << 1) + a) >> 8;` — Nolo, Mar 06 '16 at 09:17
This doesn't work. Multiplying 3 by 341 gives 1023. Right-shifting that 10 bits gives you zero, not the one you'd expect from `3 / 3`. In fact, it causes an issue of all integral multiples of three. — paxdiablo, Jul 22 '22 at 07:32
good catch @paxdiablo, looks like this was not throughly tested. for numbers below 2048, you actually need to do `x * 683 >> 11`. multiplying by 342 and shifting by 10 would be closer than 341, but would also only work for values below 512. — Jeff Brower, Sep 15 '22 at 03:24

Jay · Answer 4 · 2008-10-05T01:38:56.833

13

See How To Divide By 3 for an extended discussion of more efficiently dividing by 3, focused on doing FPGA arithmetic operations.

Also relevant:

Optimizing integer divisions with Multiply Shift in C#

edited Oct 05 '08 at 01:38

answered Oct 05 '08 at 01:31

Jay

41,768
14
66
83

pretty cool stuff however, I should have specified that I am confined to something x86-ish – Greg Dean Oct 05 '08 at 01:37
10

I know it's a bit annoying to poke at an ancient post, but the link you gave is ***DEAD*** *(Aaaaaargh!)* (I mean the first one). – Hungry Blue Dev Sep 28 '14 at 12:28

Mecki · Answer 5 · 2011-06-15T18:55:19.323

9

Depending on your platform and depending on your C compiler, a native solution like just using

y = x / 3

Can be fast or it can be awfully slow (even if division is done entirely in hardware, if it is done using a DIV instruction, this instruction is about 3 to 4 times slower than a multiplication on modern CPUs). Very good C compilers with optimization flags turned on may optimize this operation, but if you want to be sure, you are better off optimizing it yourself.

For optimization it is important to have integer numbers of a known size. In C int has no known size (it can vary by platform and compiler!), so you are better using C99 fixed-size integers. The code below assumes that you want to divide an unsigned 32-bit integer by three and that you C compiler knows about 64 bit integer numbers (NOTE: Even on a 32 bit CPU architecture most C compilers can handle 64 bit integers just fine):

static inline uint32_t divby3 (
    uint32_t divideMe
) {
    return (uint32_t)(((uint64_t)0xAAAAAAABULL * divideMe) >> 33);
}

As crazy as this might sound, but the method above indeed does divide by 3. All it needs for doing so is a single 64 bit multiplication and a shift (like I said, multiplications might be 3 to 4 times faster than divisions on your CPU). In a 64 bit application this code will be a lot faster than in a 32 bit application (in a 32 bit application multiplying two 64 bit numbers take 3 multiplications and 3 additions on 32 bit values) - however, it might be still faster than a division on a 32 bit machine.

On the other hand, if your compiler is a very good one and knows the trick how to optimize integer division by a constant (latest GCC does, I just checked), it will generate the code above anyway (GCC will create exactly this code for "/3" if you enable at least optimization level 1). For other compilers... you cannot rely or expect that it will use tricks like that, even though this method is very well documented and mentioned everywhere on the Internet.

Problem is that it only works for constant numbers, not for variable ones. You always need to know the magic number (here 0xAAAAAAAB) and the correct operations after the multiplication (shifts and/or additions in most cases) and both is different depending on the number you want to divide by and both take too much CPU time to calculate them on the fly (that would be slower than hardware division). However, it's easy for a compiler to calculate these during compile time (where one second more or less compile time plays hardly a role).

edited Jun 15 '11 at 18:55

answered Jan 12 '09 at 18:47

Mecki

125,244
33
244
253

1

I wish I could tell the compiler NOT to do that sort of thing sometimes. This makes for faster code, but also bigger code. – Matthias Wandel May 18 '09 at 22:47
+1. Not sure why this was downvoted. One idea: If you don't know the number you wish to divide by at compile time but you know it will be a small integer, you could always define a static array of these magic values and index into that. – j_random_hacker Jan 06 '11 at 05:33
8

Please don't do this, let the compiler do it. With a single 32-bit multiply the compiler will have the result in a register. That is, it will be in the mul overflow register EDX. So, your optimization isn't and you've now converted a single 32-bit multiply into a 64-bit multiply and a 64-bit shift. – Chris Hopman Jan 07 '11 at 06:06
@Chris: You're right, but I don't really see that you're in disagreement with the spirit of what Mecki's saying. If there are faster CPU-specific ways to grab the relevant 32 bits then prefer them of course, but there's no way to code that portably in C++. In fact a really smart compiler would translate Mecki's code into a 32-bit multiply and use EDX as you said (although such a smart compiler would of course know to do the same for a plain `x / 3`). – j_random_hacker Jan 08 '11 at 14:46
1

@Chris: There are those people that rely upon the compiler making their otherwise slow code fast and there are those, that try to make code fast, regardless of compiler. The first kind of people produce code that may fail horribly on some compilers and some platforms, the second one produces code that performs always good to very good, regardless of platform. The code I posted above does not produce a 64 bit multiply when using GCC on x86, in fact in only produces a 32 bit multiply (since a 32 bit multiply on x86 has a 64 bit result and GCC knows that). – Mecki Jan 10 '11 at 12:31
5

@Mecki: Actually, the latter tend to produce code that invokes undefined behavior and then stick their fingers in their ears when somebody tells them they're wrong. I'm not saying it's not sometimes worthwhile to attempt to write code that will be fast "even if your compiler sucks", but anyone doing so needs to have a thorough grasp of the C standard, undefined behavior, implementation-defined behavior, and what's valid and portable. – R.. GitHub STOP HELPING ICE Jun 09 '11 at 17:11
1

@R..: Where does it produce undefined behavior? Multiplying a 64 bit uint a 32 bit uint is defined, bit shifting a 64 bit uint is defined and casting from uint64 to uint32 is defined as well; at least in ISO-C. Is it because unsigned long long is not defined to be exactly 64 bit? Well, I'll fix that for you. Other than that, please show me one, only one ISO-C compiler where code above does not produce the desired result or where a simple / 3 produces significant faster code (any architecture and platform is accepted). – Mecki Jun 15 '11 at 18:54
2

Apologies; it was not my intent to suggest that your answer produces UB. It's perfectly correct. My comment was just that lots of people who "think they know better than the compiler" may know the asm they want the compiler to generate, but often they don't know the rules of C well enough to avoid undefined behavior and nonportable code. – R.. GitHub STOP HELPING ICE Jun 16 '11 at 01:46

score 5 · Answer 6 · answered Jan 28 '18 at 01:31

For 64 bit numbers:

uint64_t divBy3(uint64_t x)
{
    return x*12297829382473034411ULL;
}

However this isn't the truncating integer division you might expect. It works correctly if the number is already divisible by 3, but it returns a huge number if it isn't.

For example if you run it on for example 11, it returns 6148914691236517209. This looks like a garbage but it's in fact the correct answer: multiply it by 3 and you get back the 11!

If you are looking for the truncating division, then just use the / operator. I highly doubt you can get much faster than that.

Theory:

64 bit unsigned arithmetic is a modulo 2^64 arithmetic. This means for each integer which is coprime with the 2^64 modulus (essentially all odd numbers) there exists a multiplicative inverse which you can use to multiply with instead of division. This magic number can be obtained by solving the 3*x + 2^64*y = 1 equation using the Extended Euclidean Algorithm.

Where do you obtain `3*x + 2^64*y = 1` and what is its generalized form? — AMDG, Jul 13 '21 at 21:01
@AMDG The equation `A*x + B*y = 1` has an unique integer solution if the greatest common divisor of A and B is 1. — Calmarius, Jul 13 '21 at 21:07
what is the relationship between the variables here, and the dividend and divisor in the general case of dividing two integers? — AMDG, Jul 13 '21 at 22:37

Steve Hanov · Answer 7 · 2009-05-13T18:15:16.100

4

What if you really don't want to multiply or divide? Here is is an approximation I just invented. It works because (x/3) = (x/4) + (x/12). But since (x/12) = (x/4) / 3 we just have to repeat the process until its good enough.

#include <stdio.h>

void main()
{
    int n = 1000;
    int a,b;
    a = n >> 2;
    b = (a >> 2);
    a += b;
    b = (b >> 2);
    a += b;
    b = (b >> 2);
    a += b;
    b = (b >> 2);
    a += b;
    printf("a=%d\n", a);
}

The result is 330. It could be made more accurate using b = ((b+2)>>2); to account for rounding.

If you are allowed to multiply, just pick a suitable approximation for (1/3), with a power-of-2 divisor. For example, n * (1/3) ~= n * 43 / 128 = (n * 43) >> 7.

This technique is most useful in Indiana.

edited May 13 '09 at 18:15

answered May 13 '09 at 17:57

Steve Hanov

11,316
16
62
69

Pick 2^33 as the divisor and you end up with Mecki's answer – Raniz May 10 '17 at 12:33
unfortunately it won't work for multiples of 3 (it'll return 0 for n = 3 and 1 for n = 6). You need some more special checks on that – phuclv Jul 27 '18 at 06:08
[Hacker's delight have some rounding after the shift step](http://www.hackersdelight.org/divcMore.pdf). Unfortunate I can't understand how that works – phuclv Jul 29 '18 at 03:37

score 2 · Answer 8 · answered Oct 05 '08 at 01:19

2

I don't know if it's faster but if you want to use a bitwise operator to perform binary division you can use the shift and subtract method described at this page:

Set quotient to 0

Align leftmost digits in dividend and divisor

Repeat:

If that portion of the dividend above the divisor is greater than or equal to the divisor:

Then subtract divisor from that portion of the dividend and

Concatentate 1 to the right hand end of the quotient

Else concatentate 0 to the right hand end of the quotient

Shift the divisor one place right

Until dividend is less than the divisor:

quotient is correct, dividend is remainder

STOP

answered Oct 05 '08 at 01:19

Mark Cidade

98,437
31
224
236

I suspect that this is not faster. It is more or less a algorithm for doing binary division. – Greg Dean Oct 05 '08 at 01:22
3

I finally found it! Years and years ago I came across a 6502 assembly routine for division and somehow lost it over time. 6502 has no mul/div opcodes, so this is the only way. Now I know! Thanks. – spoulson Oct 07 '08 at 14:12

Carlo V. Dango · Answer 9 · 2012-05-20T14:38:42.893

For really large integer division (e.g. numbers bigger than 64bit) you can represent your number as an int[] and perform division quite fast by taking two digits at a time and divide them by 3. The remainder will be part of the next two digits and so forth.

eg. 11004 / 3 you say

11/3 = 3, remaineder = 2 (from 11-3*3)

20/3 = 6, remainder = 2 (from 20-6*3)

24/3 = 8, remainder = 0

hence the result 3668

internal static List<int> Div3(int[] a)
{
  int remainder = 0;
  var res = new List<int>();
  for (int i = 0; i < a.Length; i++)
  {
    var val = remainder + a[i];
    var div = val/3;

    remainder = 10*(val%3);
    if (div > 9)
    {
      res.Add(div/10);
      res.Add(div%10);
    }
    else
      res.Add(div);
  }
  if (res[0] == 0) res.RemoveAt(0);
  return res;
}

score 1 · Answer 10 · edited Jul 13 '21 at 06:28

1

If you really want to see this article on integer division, but it only has academic merit ... it would be an interesting application that actually needed to perform that benefited from that kind of trick.

edited Jul 13 '21 at 06:28

AMDG

1,118
1
13
29

answered Oct 05 '08 at 01:21

Rob Walker

46,588
15
99
136

Not sure what's supposed to be "academic" here. Pretty much every compiler optimizes divisions by constants with this or very similar techniques. – soc Jan 30 '15 at 22:34

score 0 · Answer 11 · answered May 09 '13 at 15:30

0

Easy computation ... at most n iterations where n is your number of bits:

uint8_t divideby3(uint8_t x)
{
  uint8_t answer =0;
  do
  {
    x>>=1;
    answer+=x;
    x=-x;
  }while(x);
  return answer;
}

answered May 09 '13 at 15:30

Albert Gonzalez

1

The cool thing here is that this is a naive hardware implementation for this problem. The less cool thing is that this basically cannot be faster then any hardware implementation, unless the hardware implementation of the CPU is trying. – Cheiron Aug 18 '17 at 12:09

score 0 · Answer 12 · answered Jul 17 '13 at 06:55

0

A lookup table approach would also be faster in some architectures.

uint8_t DivBy3LU(uint8_t u8Operand)
{
   uint8_t ai8Div3 = [0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, ....];

   return ai8Div3[u8Operand];
}

answered Jul 17 '13 at 06:55

Luke

1

What's the fastest way to divide an integer by 3?

12 Answers12

Linked

Related