38

What are the fastest divisibility tests? Say, given a little-endian architecture and a 32-bit signed integer: how to calculate very fast that a number is divisible by 2,3,4,5,... up to 16?

WARNING: given code is EXAMPLE only. Every line is independent! Just obvious solution using modulo operation is slow on many processors, which don't have DIV hardware (like many ARMs). Some compilers are also cannot make such optimizations (say, if divisor is a function's argument or is dependent on something).

Divisible_by_1 = do();
Divisible_by_2 = if (!(number & 1)) do();
Divisible_by_3 = ?
Divisible_by_4 = ?
Divisible_by_5 = ?
Divisible_by_6 = ?
Divisible_by_7 = ?
Divisible_by_8 = ?
Divisible_by_9 = ?
Divisible_by_10 = ?
Divisible_by_11 = ?
Divisible_by_12 = ?
Divisible_by_13 = ?
Divisible_by_14 = ?
Divisible_by_15 = ?
Divisible_by_16 = if(!number & 0x0000000F) do();

and special cases:

Divisible_by_2k = if(number & (tk-1)) do();  //tk=2**k=(2*2*2*...) k times
psihodelia
  • 29,566
  • 35
  • 108
  • 157
  • 8
    Obviously, divisibility by 4, 8, 16 can be checked by (v & N) == 0, where N is 4, 8 and 16. – Catherine Aug 01 '11 at 09:34
  • can the divisibility be checked in any order? or it must be increasing 2 to 16? – Sherif elKhatib Aug 01 '11 at 09:38
  • 4
    I think it might be possible to be better off than just using a modulo == 0 check. But it is really hard, if not impossible, to be sure that some solution actually is faster - especially if the claim has to hold on different systems / CPUs. Especially if you have a construct n % CONST == 0, why shouldn't a compiler be able to detect the very best way on your particular architecture? – b.buchhold Aug 01 '11 at 09:39
  • @b.buchhold On any (binary) CPU the logical AND will be faster than division. – Catherine Aug 01 '11 at 09:41
  • @Sherif: any order of course ! – psihodelia Aug 01 '11 at 09:44
  • Then 16 will be also divisible by 8,4,2 and 1. Do you want to run do() for each one? – graham.reeds Aug 01 '11 at 09:47
  • 1
    Also how long will do() take to run? Does this need to optimised? Have you profiled and is this the slowest part of your app? – graham.reeds Aug 01 '11 at 09:50
  • @graham.reeds: I need general methods. Presented code is example only. – psihodelia Aug 01 '11 at 09:51
  • @starblue: Sometimes I wanna check whether a number is divisible by 3, sometimes by 13, etc. – psihodelia Aug 01 '11 at 10:02
  • 12
    Without 1) **precise** program and instruction **workflow** 2) a **strong** indication that you have been profiling your program and **proven** that modulo is not fast enough for your needs, I vote to close as non constructive. Bitching about "and is faster than modulo" etc without **compiler generated assembly listings** and **strong profiling results** is absolutely non constructive. – Alexandre C. Aug 01 '11 at 10:06
  • 3
    @Alexandre: attempting to call % operation with any odd argument will generate a call to GCC EABI function, which is about 20-70 times slower than any logic operation (ARM9 processor). – psihodelia Aug 01 '11 at 10:12
  • @psihodelia: does it matter for your *particular* problem ? Potential interesting answers will heavily depend on the program at hand and the *exact* architecture involved. Also, writing % in your program doesn't necessarily imply a modulo CPU instruction. Writing `x % 2 == 0` will likely be transformed into `x & 1`, whichever is faster on your target architecture. – Alexandre C. Aug 01 '11 at 10:25
  • @Alexandre: yes, this question is platform- and compiler- independent! – psihodelia Aug 01 '11 at 10:25
  • -1 Again, what do you need the divisibility tests for? – starblue Aug 01 '11 at 10:57
  • 4
    @starblue: I am implementing a special tricky Fast Fourier Transform and I am interested in fastest possible divisibility tests (I work with C compilers and assemblers) – psihodelia Aug 01 '11 at 11:01
  • Have you considered rearranging your algorithm so you don't need to compute the remainder? For example by weakening, i.e. replacing it by increment and conditional subtraction of the modulus in a loop? – starblue Aug 01 '11 at 11:30
  • 5
    @Alexandre C: your choice of language, hasty conclusions and "don't optimize" attitude are the "non constructive" components here. – Olof Forshell Aug 02 '11 at 13:17
  • 1
    @Olof: trying to optimize arithmetic operations without knowing precisely the target architecture is *stupid*. Moreover, we still don't have a proof that the modulus here is the bottleneck. When writing a FFT routine, my bet (I'd like actual figures though) is that cache locality problems are more important performancewise than a few cycles spent in pointlessly trying to teach a compiler old tricks. The positive thing about this is that the greatest answers here are "don't optimize, let the compiler do". – Alexandre C. Aug 02 '11 at 13:21
  • 2
    @Alexandre C: as psihodelia puts it "attempting to call % operation with any odd argument will generate a call to GCC EABI function" so there has obviously been some "instruction workflow" analysis. Also the speed of a division instruction (if one exists) on an ARM is likely to be as slow as if not slower than a round trip to RAM. The stated EABI function execution time is an indicator of this. – Olof Forshell Aug 02 '11 at 13:30
  • 2
    @Olof: We still don't have **sample code**, target processor details and profiler output. Without this, nobody can answer the question precisely. Asking a useful question is all about *giving the necessary details*, not vague bitching about division performance (which will inevitably trigger the 'your compiler knows better' answers, which are **true** if no more detail is given). – Alexandre C. Aug 02 '11 at 13:36
  • @Olof: for instance, Mark Mann's answer is great, but may be useless depending on the actual code to optimize. I am not a never-optimize guy, but I learnt to optimize *actual working programs*, not ideas. – Alexandre C. Aug 02 '11 at 13:42
  • 2
    @Alexandre C: it is possible to predict what general code a compiler will produce given a certain construct. "% v" where v is a variable will invariably generate a divison instruction (or its equivalent) because the compiler cannot know what v will contain. Divisions are always very slow. BTW how do you propose posting "sample code, processor details and profiler output" in manageable form here? I read that you're actually saying "vote to close". – Olof Forshell Aug 02 '11 at 13:47
  • @Olof: "it is possible to predict what general code a compiler will produce given a certain construct". No it is not. It *may* have been the case 10 years ago though. Just actual sample code so that we could play a little with it would have been OK. In this form, the question cannot be answered satisfactorily. – Alexandre C. Aug 02 '11 at 13:57
  • 2
    @Alexandre C: I'm absolutely allergic to people with absolute points of view in something as fluid as, for instance, computer programming. I guess they are the ones that uphold status quo while others expand the envelope. As to code generation I wrote "general" i e did not mean "exact." – Olof Forshell Aug 03 '11 at 07:02
  • 1
    @whitequark did you mean (v & N-1)==0? – Deqing Sep 28 '13 at 10:20
  • 1
    @Deqing yes, `(v & (N-1)) == 0`. – Catherine Sep 29 '13 at 09:46
  • [This](http://stackoverflow.com/a/18750881/1237747) answer also may be helpful. – ST3 Oct 03 '13 at 10:14

16 Answers16

41

In every case (including divisible by 2):

if (number % n == 0) do();

Anding with a mask of low order bits is just obfuscation, and with a modern compiler will not be any faster than writing the code in a readable fashion.

If you have to test all of the cases, you might improve performance by putting some of the cases in the if for another: there's no point it testing for divisibility by 4 if divisibility by 2 has already failed, for example.

James Kanze
  • 150,581
  • 18
  • 184
  • 329
  • @Ali Veli: and it's very slow, because % requires division operation, which is very slow on many CPUs – psihodelia Aug 01 '11 at 09:47
  • 3
    Your solution is very slow, because you implicitly use division operation ! – psihodelia Aug 01 '11 at 09:48
  • The compiler should optimize to a bit test for powers of two. – starblue Aug 01 '11 at 09:56
  • 11
    @psihodelia: Have you actually tried to check the compiler's generated assembly? – kennytm Aug 01 '11 at 09:56
  • I am not interested in sequential tests! Given code is EXAMPLE only! I am looking for methods of fast divisibility tests – psihodelia Aug 01 '11 at 09:59
  • 3
    @psihodelia Then there's not much you can do to improve on `number % n == 0`. – James Kanze Aug 01 '11 at 10:06
  • 23
    @psihodelia My solution generates exactly the same machine code as yours, at least with g++ (and this is without optimization). From experience, trying to beat the compiler at this sort of thing is a loosing proposition: the compiler knows more about the subtleties of your machine than you do, and will do a better job at finding the optimal machine instructions. Formulating the expression for something other than what you really want will inhibit the compiler in this, and sometimes result in worse code. – James Kanze Aug 01 '11 at 10:11
  • 1
    @James: please check what your compiler generates if you increment n in a loop – psihodelia Aug 01 '11 at 10:16
  • 18
    @psihodelia If n is a variable, it will generate a division. Obviously, since it can't know what value to optimize for. On the other hand, I just wrote a function `template bool isDivisibleBy( int number )`, and instantiated it for all values between 2 and 16, and the compiler didn't generated a single division. (VC++ optimizes out the division for powers of 2, but not for other values.) – James Kanze Aug 01 '11 at 10:51
  • Anding with a mask of low order bits is much faster, as I have demonstrated on a similar question RE: divisible by two. IIRC it runs in about 65% of the time of modulo. http://stackoverflow.com/a/16369720/1899861 –  Oct 22 '14 at 06:24
  • 2
    @RocketRoy If this is the case, I'd suggest you change your compiler, or at least complain about it. Compilers have been changing multiplication, division and modulo into comparable shift, add and subtract operations for at least 20 years. – James Kanze Oct 22 '14 at 10:19
22

It is not a bad idea AT ALL to figure out alternatives to division instructions (which includes modulo on x86/x64) because they are very slow. Slower (or even much slower) than most people realize. Those suggesting "% n" where n is a variable are giving foolish advice because it will invariably lead to the use of the division instruction. On the other hand "% c" (where c is a constant) will allow the compiler to determine the best algorithm available in its repertoire. Sometimes it will be the division instruction but a lot of the time it won't.

In this document Torbjörn Granlund shows that the ratio of clock cycles required for unsigned 32-bit mults:divs is 4:26 (6.5x) on Sandybridge and 3:45 (15x) on K10. for 64-bit the respective ratios are 4:92 (23x) and 5:77 (14.4x).

The "L" columns denote latency. "T" columns denote throughput. This has to do with the processor's ability to handle multiple instructions in parallell. Sandybridge can issue one 32-bit multiplication every other cycle or one 64-bit every cycle. For K10 the corresponding throughput is reversed. For divisions the K10 needs to complete the entire sequence before it may begin another. I suspect it is the same for Sandybridge.

Using the K10 as an example it means that during the cycles required for a 32-bit division (45) the same number (45) of multiplications can be issued and the next-to-last and last one of these will complete one and two clock cycles after the division has completed. A LOT of work can be performed in 45 multiplications.

It is also interesting to note that divs have become less efficient with the evolution from K8-K9 to K10: from 39 to 45 and 71 to 77 clock cycles for 32- and 64-bit.

Granlund's page at gmplib.org and at the Royal Institute of Technology in Stockholm contain more goodies, some of which have been incorporated into the gcc compiler.

Olof Forshell
  • 3,169
  • 22
  • 28
  • It's been awhile, but IIRC division on shorter integer types on x86 gets faster and faster. EG: an int_8 division is 9X faster than an int_32 division. Not even a little bit like proportional to the size, is it? Weird, but true. –  Oct 22 '14 at 06:28
  • @RocketRoy: On recent x86 microarchitectures like Sandybridge or Haswell with powerful high-radix dividers, integer division is only slightly faster for int8_t than int32_t. But `int64_t` is 2x to 3x slower than `int32_t`: On Haswell, latencies for `idiv r8`: 23-26. For `idiv r32`: 22-29 cycles, and for `idiv r64`: 39-103. (Worst-case throughput is also better for smaller registers). Even going back to Pentium II, there was only a 2x latency / 3x throughput difference between 8-bit vs. 32-bit. AMD Ryzen has 13-16 cycle `idiv r8`, and 14-30 cycle `idiv r32` (same best case, 2x worst case) – Peter Cordes Mar 14 '18 at 04:47
21

As @James mentioned, let the compiler simplify it for you. If n is a constant, any decent compiler is able to recognize the pattern and change it to a more efficient equivalent.

For example, the code

#include <stdio.h>

int main() {
    size_t x;
    scanf("%u\n", &x);
    __asm__ volatile ("nop;nop;nop;nop;nop;");
    const char* volatile foo = (x%3 == 0) ? "yes" : "no";
    __asm__ volatile ("nop;nop;nop;nop;nop;");
    printf("%s\n", foo);
    return 0;
}

compiled with g++-4.5 -O3, the relevant part of x%3 == 0 will become

mov    rcx,QWORD PTR [rbp-0x8]   # rbp-0x8 = &x
mov    rdx,0xaaaaaaaaaaaaaaab
mov    rax,rcx
mul    rdx
lea    rax,"yes"
shr    rdx,1
lea    rdx,[rdx+rdx*2]
cmp    rcx,rdx
lea    rdx,"no"
cmovne rax,rdx
mov    QWORD PTR [rbp-0x10],rax

which, translated back to C code, means

(hi64bit(x * 0xaaaaaaaaaaaaaaab) / 2) * 3 == x ? "yes" : "no"
// equivalatent to:                 x % 3 == 0 ? "yes" : "no"

no division involved here. (Note that 0xaaaaaaaaaaaaaaab == 0x20000000000000001L/3)


Edit:

ZachB
  • 13,051
  • 4
  • 61
  • 89
kennytm
  • 510,854
  • 105
  • 1,084
  • 1,005
17

Assume number is unsigned (32-bits). Then the following are very fast ways to compute divisibility up to 16. (I haven't measured but the assembly code indicates so.)

bool divisible_by_2 = number % 2 == 0;
bool divisible_by_3 = number * 2863311531u <= 1431655765u;
bool divisible_by_4 = number % 4 == 0;
bool divisible_by_5 = number * 3435973837u <= 858993459u;
bool divisible_by_6 = divisible_by_2 && divisible_by_3;
bool divisible_by_7 = number * 3067833783u <= 613566756u;
bool divisible_by_8 = number % 8 == 0;
bool divisible_by_9 = number * 954437177u <= 477218588u;
bool divisible_by_10 = divisible_by_2 && divisible_by_5;
bool divisible_by_11 = number * 3123612579u <= 390451572u;
bool divisible_by_12 = divisible_by_3 && divisible_by_4;
bool divisible_by_13 = number * 3303820997u <= 330382099u;
bool divisible_by_14 = divisible_by_2 && divisible_by_7;
bool divisible_by_15 = number * 4008636143u <= 286331153u;
bool divisible_by_16 = number % 16 == 0;

Regarding divisibility by d the following rules hold:

  • When d is a power of 2:

As pointed out by James Kanze, you can use is_divisible_by_d = (number % d == 0). Compilers are clever enough to implement this as (number & (d - 1)) == 0 which is very efficient but obfuscated.

However, when d is not a power of 2 it looks like the obfuscations shown above are more efficient than what current compilers do. (More on that later).

  • When d is odd:

The technique takes the form is_divisible_by_d = number * a <= b where a and b are cleverly obtained constants. Notice that all we need is 1 multiplication and 1 comparison:

  • When d is even but not a power of 2:

Then, write d = p * q where p is a power of 2 and q is odd and use the "tongue in cheek" suggested by unpythonic, that is, is_divisible_by_d = is_divisible_by_p && is_divisible_by_q. Again, only 1 multiplication (in the calculation of is_divisible_by_q) is performed.

Many compilers (I've tested clang 5.0.0, gcc 7.3, icc 18 and msvc 19 using godbolt) replace number % d == 0 by (number / d) * d == number. They use a clever technique (see references in Olof Forshell's answer) to replace the division by a multiplication and a bit shift. They end up doing 2 multiplications. In contrast the techniques above perform only 1 multiplication.

Update 01-Oct-2018

Looks like the algorithm above is coming to GCC soon (already in trunk):

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82853

The GCC's implementation seems even more efficient. Indeed, the implementation above has three parts: 1) divisibility by the divisor's even part; 2) divisibility by the divisor's odd part; 3) && to connect the results of the two previous steps. By using an assembler instruction which is not efficiently available in standard C++ (ror), GCC wraps up the three parts into a single one which is very similar to that of divisibility by the odd part. Great stuff! Having this implementation available, it's better (for both clarity and performance) to fall back to % all times.

Update 05-May-2020

My articles on the subject have been published:

Quick Modular Calculations (Part 1), Overload Journal 154, December 2019, pages 11-15.

Quick Modular Calculations (Part 2), Overload Journal 155, February 2020, pages 14-17.

Quick Modular Calculations (Part 3), Overload Journal 156, April 2020, pages 10-13.

bazzilic
  • 826
  • 2
  • 8
  • 22
Cassio Neri
  • 19,583
  • 7
  • 46
  • 68
  • Do this single-multiplication versions really work for *all* valid integer inputs? Put these in a loop for `number` from `0..UINT_MAX` and check the boolean results of your method against `number % 13 == 0`. e.g. [like this](https://stackoverflow.com/questions/5558492/divide-by-10-using-bit-shifts#comment80586900_5558614) for a related division hack which fails for large numbers. (If this does work for the full range, though, it's cool and compilers should be using it! Only needing the low half of a multiply result is a nice bonus on top of only needing one multiply) – Peter Cordes Mar 14 '18 at 05:21
  • 1
    @PeterCordes It does. The [reference](http://clomont.com/efficient-divisibility-testing/) mathematically proves it (kudos to Chris Lomont). In addition, before posting I've done the test that you suggested. Compilers definitely should be using. Notice that the constants above are for 32 bits unsigned integers . The same reference gives the constants for 64 bits unsigned integers and it explains how the constants are obtained. – Cassio Neri Mar 14 '18 at 09:52
  • That's awesome. Have you opened missed-optimization bug reports on gcc and clang? – Peter Cordes Mar 14 '18 at 09:55
  • @PeterCordes Good idea. Will do. – Cassio Neri Mar 14 '18 at 09:58
  • 1
    There was already a [bug report](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=12849) for gcc. – Cassio Neri Mar 14 '18 at 10:12
  • 1
    And [one](https://bugs.llvm.org/show_bug.cgi?id=35479) for clang as well. – Cassio Neri Mar 14 '18 at 10:18
  • 1
    It is possible to write a rotate in ISO C++ in a way that will compile to a hardware rotate instruction with good compilers. [Best practices for circular shift (rotate) operations in C++](https://stackoverflow.com/a/13732181). Anyway, very cool trick, thanks for writing up this answer with links to compiler bug reports. – Peter Cordes Oct 02 '18 at 01:01
  • @PeterCordes Indeed. I confess I've assumed compilers would not be that clever! I've updated the answer to strikethrough my previous claim. – Cassio Neri Oct 02 '18 at 15:22
  • The GCC bug is "RESOLVED FIXED" but how will we know when the fix has been been deployed? – Gumby The Green Jun 17 '19 at 01:29
  • For me, the multiplication trick with odd #s only helps with 64-bit and 128-bit integers. With the latter, it's **way** faster than the mod operator. E.g., `n * 226854911280625642308916404954512140971 <= 113427455640312821154458202477256070485` is 17x faster than `n % 3 == 0` where all three integers are 128-bit (unsigned). – Gumby The Green Jun 17 '19 at 06:25
  • 2
    @GumbyTheGreen The implementation is in gcc 9.1. See [here](https://godbolt.org/z/31aCbP). Play with compiler versions and notice the difference in implementations (8.3 uses the "traditional" algorithm). Unfortunately, there are outstanding issues. (See my comment at the bottom of the [bug report](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=12849).) – Cassio Neri Jun 17 '19 at 19:12
15

A bit tongue in cheek, but assuming you get the rest of the answers:

Divisible_by_6  = Divisible_by_3 && Divisible_by_2;
Divisible_by_10 = Divisible_by_5 && Divisible_by_2;
Divisible_by_12 = Divisible_by_4 && Divisible_by_3;
Divisible_by_14 = Divisible_by_7 && Divisible_by_2;
Divisible_by_15 = Divisible_by_5 && Divisible_by_3;
unpythonic
  • 4,020
  • 19
  • 20
8

First of all, I remind you that a number in the form bn...b2b1b0 in binary has value:

number = bn*2^n+...+b2*4+b1*2+b0

Now, when you say number%3, you have:

number%3 =3= bn*(2^n % 3)+...+b2*1+b1*2+b0

(I used =3= to indicate congruence modulo 3). Note also that b1*2 =3= -b1*1

Now I will write all the 16 divisions using + and - and possibly multiplication (note that multiplication could be written as shift or sum of same value shifted to different locations. For example 5*x means x+(x<<2) in which you compute x once only)

Let's call the number n and let's say Divisible_by_i is a boolean value. As an intermediate value, imagine Congruence_by_i is a value congruent to n modulo i.

Also, lets say n0 means bit zero of n, n1 means bit 1 etc, that is

ni = (n >> i) & 1;

Congruence_by_1 = 0
Congruence_by_2 = n&0x1
Congruence_by_3 = n0-n1+n2-n3+n4-n5+n6-n7+n8-n9+n10-n11+n12-n13+n14-n15+n16-n17+n18-n19+n20-n21+n22-n23+n24-n25+n26-n27+n28-n29+n30-n31
Congruence_by_4 = n&0x3
Congruence_by_5 = n0+2*n1-n2-2*n3+n4+2*n5-n6-2*n7+n8+2*n9-n10-2*n11+n12+2*n13-n14-2*n15+n16+2*n17-n18-2*n19+n20+2*n21-n22-2*n23+n24+2*n25-n26-2*n27+n28+2*n29-n30-2*n31
Congruence_by_7 = n0+2*n1+4*n2+n3+2*n4+4*n5+n6+2*n7+4*n8+n9+2*n10+4*n11+n12+2*n13+4*n14+n15+2*n16+4*n17+n18+2*n19+4*n20+n21+2*n22+4*n23+n24+2*n25+4*n26+n27+2*n28+4*n29+n30+2*n31
Congruence_by_8 = n&0x7
Congruence_by_9 = n0+2*n1+4*n2-n3-2*n4-4*n5+n6+2*n7+4*n8-n9-2*n10-4*n11+n12+2*n13+4*n14-n15-2*n16-4*n17+n18+2*n19+4*n20-n21-2*n22-4*n23+n24+2*n25+4*n26-n27-2*n28-4*n29+n30+2*n31
Congruence_by_11 = n0+2*n1+4*n2+8*n3+5*n4-n5-2*n6-4*n7-8*n8-5*n9+n10+2*n11+4*n12+8*n13+5*n14-n15-2*n16-4*n17-8*n18-5*n19+n20+2*n21+4*n22+8*n23+5*n24-n25-2*n26-4*n27-8*n28-5*n29+n30+2*n31
Congruence_by_13 = n0+2*n1+4*n2+8*n3+3*n4+6*n5-n6-2*n7-4*n8-8*n9-3*n10-6*n11+n12+2*n13+4*n14+8*n15+3*n16+6*n17-n18-2*n19-4*n20-8*n21-3*n22-6*n3+n24+2*n25+4*n26+8*n27+3*n28+6*n29-n30-2*n31
Congruence_by_16 = n&0xF

Or when factorized:

Congruence_by_1 = 0
Congruence_by_2 = n&0x1
Congruence_by_3 = (n0+n2+n4+n6+n8+n10+n12+n14+n16+n18+n20+n22+n24+n26+n28+n30)-(n1+n3+n5+n7+n9+n11+n13+n15+n17+n19+n21+n23+n25+n27+n29+n31)
Congruence_by_4 = n&0x3
Congruence_by_5 = n0+n4+n8+n12+n16+n20+n24+n28-(n2+n6+n10+n14+n18+n22+n26+n30)+2*(n1+n5+n9+n13+n17+n21+n25+n29-(n3+n7+n11+n15+n19+n23+n27+n31))
Congruence_by_7 = n0+n3+n6+n9+n12+n15+n18+n21+n24+n27+n30+2*(n1+n4+n7+n10+n13+n16+n19+n22+n25+n28+n31)+4*(n2+n5+n8+n11+n14+n17+n20+n23+n26+n29)
Congruence_by_8 = n&0x7
Congruence_by_9 = n0+n6+n12+n18+n24+n30-(n3+n9+n15+n21+n27)+2*(n1+n7+n13+n19+n25+n31-(n4+n10+n16+n22+n28))+4*(n2+n8+n14+n20+n26-(n5+n11+n17+n23+n29))
// and so on

If these values end up being negative, add it with i until they become positive.

Now what you should do is recursively feed these values through the same process we just did until Congruence_by_i becomes less than i (and obviously >= 0). This is similar to what we do when we want to find remainder of a number by 3 or 9, remember? Sum up the digits, if it had more than one digit, some up the digits of the result again until you get only one digit.

Now for i = 1, 2, 3, 4, 5, 7, 8, 9, 11, 13, 16:

Divisible_by_i = (Congruence_by_i == 0);

And for the rest:

Divisible_by_6 = Divisible_by_3 && Divisible_by_2;
Divisible_by_10 = Divisible_by_5 && Divisible_by_2;
Divisible_by_12 = Divisible_by_4 && Divisible_by_3;
Divisible_by_14 = Divisible_by_7 && Divisible_by_2;
Divisible_by_15 = Divisible_by_5 && Divisible_by_3;

Edit: Note that some of the additions could be avoided from the very beginning. For example n0+2*n1+4*n2 is the same as n&0x7, similarly n3+2*n4+4*n5 is (n>>3)&0x7 and thus with each formula, you don't have to get each bit individually, I wrote it like that for the sake of clarity and similarity in operation. To optimize each of the formulas, you should work on it yourself; group operands and factorize operation.

Shahbaz
  • 46,337
  • 19
  • 116
  • 182
7

The LCM of these numbers seems to be 720720. Its quite small, so that you can perform a single modulus operation and use the remainder as the index in the precomputed LUT.

ruslik
  • 14,714
  • 1
  • 39
  • 40
  • 3
    You only need the LCM of the odd primes: 15015. And there are only 5 primes, so the LUT doesn't need more than 5 bits. 75075 bits total. – MSalters Aug 02 '11 at 08:43
6

You should just use (i % N) == 0 as your test.

My compiler (a fairly old version of gcc) generated good code for all the cases I tried. Where bit tests were appropriate it did that. Where N was a constant it didn't generate the obvious "divide" for any case, it always used some "trick".

Just let the compiler generate the code for you, it will almost certainly know more about the architecture of the machine than you do :) And these are easy optimisations where you are unlikely to think up something better than the compiler does.

It's an interesting question though. I can't list the tricks used by the compiler for each constant as I have to compile on a different computer.. But I'll update this reply later on if nobody beats me to it :)

jcoder
  • 29,554
  • 19
  • 87
  • 130
5

This probably won't help you in code, but there's a neat trick which can help do this in your head in some cases:

For divide by 3: For a number represented in decimal, you can sum all the digits, and check if the sum is divisible by 3.

Example: 12345 => 1+2+3+4+5 = 15 => 1+5 = 6, which is divisible by 3 (3 x 4115 = 12345).

More interestingly the same technique works for all factors of X-1, where X is the base in which the number is represented. So for decimal number, you can check divide by 3 or 9. For hex, you can check divide by 3,5 or 15. And for octal numbers, you can check divide by 7.

Roddy
  • 66,617
  • 42
  • 165
  • 277
  • good idea, and good that you mention that this code is probably slower than modulo. – Mooing Duck Jan 25 '12 at 16:30
  • If you have a number as a string, the first adding digits can be very fast. (e.g. a few instructions on x86 with SSE2 `psadbw` to sum up to 16 digits). But doing it repeatedly down to a single digit requires modulo by 10 to break the binary integer into decimal digits, so you might as well just let the compiler use a magic-constant multiply to check for divisibility by 3 in the first place. But if your number is larger than a single register (e.g. int64_t on a 32-bit machine), and you already have a decimal string representation, this could be a win. – Peter Cordes Mar 14 '18 at 04:56
  • 1
    gcc doesn't use [the multiplicative-inverse trick](https://stackoverflow.com/questions/41183935/why-does-gcc-use-multiplication-by-a-strange-number-in-implementing-integer-divi) for integers wider than a register, where it would take 4 multiplies and some `adc` to produce the high half of the full result. Instead it passes the constant to a libgcc division function that uses regular `div` instructions. – Peter Cordes Mar 14 '18 at 04:59
4

A method that can help modulo reduction of all integer values uses bit-slicing and popcount.

mod3 = pop(x & 0x55555555) + pop(x & 0xaaaaaaaa) << 1;  // <- one term is shared!
mod5 = pop(x & 0x99999999) + pop(x & 0xaaaaaaaa) << 1 + pop(x & 0x44444444) << 2;
mod7 = pop(x & 0x49249249) + pop(x & 0x92492492) << 1 + pop(x & 0x24924924) << 2;
modB = pop(x & 0x5d1745d1) + pop(x & 0xba2e8ba2) << 1 + 
       pop(x & 0x294a5294) << 2 + pop(x & 0x0681a068) << 3;
modD = pop(x & 0x91b91b91) + pop(x & 0xb2cb2cb2) << 1 +
       pop(x & 0x64a64a64) << 2 + pop(x & 0xc85c85c8) << 3;

The maximum values for these variables are 48, 80, 73, 168 and 203, which all fit into 8-bit variables. The second round can be carried in parallel (or some LUT method can be applied)

      mod3 mod3 mod5 mod5 mod5 mod7 mod7 mod7 modB modB modB modB modD modD modD modD
mask  0x55 0xaa 0x99 0xaa 0x44 0x49 0x92 0x24 0xd1 0xa2 0x94 0x68 0x91 0xb2 0x64 0xc8
shift  *1   *2   *1   *2   *4   *1   *2   *4   *1   *2   *4   *8   *1   *2   *4   *8
sum   <-------> <------------> <----------->  <-----------------> <----------------->
Aki Suihkonen
  • 19,144
  • 1
  • 36
  • 57
4

In a previous question, I showed a fast algorithm to check in base N for divisors that are factors of N-1. Base transformations between different powers of 2 are trivial; that's just bit grouping.

Therefore, checking for 3 is easy in base 4; checking for 5 is easy in base 16, and checking for 7 (and 9) is easy in base 64.

Non-prime divisors are trivial, so only 11 and 13 are hard cases. For 11, you could use base 1024, but at that point it's not really efficient for small integers.

Community
  • 1
  • 1
MSalters
  • 173,980
  • 10
  • 155
  • 350
3

You can replace division by a non-power-of-two constant by a multiplication, essentially multiplying by the reciprocal of your divisor. The details to get the exact result by this method are complicated.

Hacker's Delight discusses this at length in chapter 10 (unfortunately not available online).

From the quotient you can get the modulus by another multiplication and a subtraction.

starblue
  • 55,348
  • 14
  • 97
  • 151
  • Actually ... _that_ specific chapter of Hacker's Delight _is_ available online: http://www.hackersdelight.org/divcMore.pdf – FrankH. Aug 20 '12 at 13:47
  • @FrankH. Good find, but from the text it seems to be an addition of more material on this topic. – starblue Aug 21 '12 at 18:37
  • See also [Why does GCC use multiplication by a strange number in implementing integer division?](https://stackoverflow.com/questions/41183935/why-does-gcc-use-multiplication-by-a-strange-number-in-implementing-integer-divi) for the details on how / why it works. Gcc will do it for you, except for `int64_t` on a 32-bit machine. (Or in general, with integers wider than a single register). – Peter Cordes Mar 14 '18 at 05:00
3

One thing to consider: since you only care about divisibility up to 16, you really only need to check divisibility by the primes up to 16. These are 2, 3, 5, 7, 11, and 13.

Divide your number by each of the primes, keeping track with a boolean (such as div2 = true). The numbers two and three are special cases. If div3 is true, try dividing by 3 again, setting div9. Two and its powers are very simple (note: '&' is one of the fastest things a processor can do):

if n & 1 == 0:
    div2 = true
    if n & 3 == 0: 
        div4 = true
        if n & 7 == 0: 
            div8 = true
            if n & 15 == 0:
                div16 = true

You now have the booleans div2, div3, div4, div5, div7, div8, div9, div11, div13, and div16. All other numbers are combinations; for instance div6 is the same as (div2 && div3)

So, you only need to do either 5 or 6 actual divisions (6 only if your number is divisible by 3).

For myself, i would probably use bits in a single register for my booleans; for instance bit_0 means div2. I can then use masks:

if (flags & (div2+div3)) == (div2 + div3): do_6()

note that div2+div3 can be a precomputed constant. If div2 is bit0, and div3 is bit1, then div2+div3 == 3. This makes the above 'if' optimize to:

if (flags & 3) == 3: do_6()

So now... mod without a divide:

def mod(n,m):
    i = 0
        while m < n:
            m <<= 1
            i += 1
        while i > 0:
            m >>= 1
            if m <= n: n -= m
            i -= 1
     return n

div3 = mod(n,3) == 0
...

btw: the worst case for the above code is 31 times through either loop for a 32-bit number

FYI: Just looked at Msalter's post, above. His technique can be used instead of mod(...) for some of the primes.

Jim
  • 641
  • 4
  • 9
1

Fast tests for divisibility depend heavily on the base in which the number is represented. In case when base is 2, I think you can only do "fast tests" for divisibility by powers of 2. A binary number is divisible by 2n iff the last n binary digits of that number are 0. For other tests I don't think you can generally find anything faster than %.

Armen Tsirunyan
  • 130,161
  • 59
  • 324
  • 434
0

A bit of evil, obfuscated bit-twiddling can get you divisbility by 15.

For a 32-bit unsigned number:

def mod_15ish(unsigned int x) {
  // returns a number between 0 and 21 that is either x % 15
  // or 15 + (x % 15), and returns 0 only for x == 0
  x = (x & 0xF0F0F0F) + ((x >> 4) & 0xF0F0F0F);
  x = (x & 0xFF00FF) + ((x >> 8) & 0xFF00FF);  
  x = (x & 0xFFFF) + ((x >> 16) & 0xFFFF);
  // *1
  x = (x & 0xF) + ((x >> 4) & 0xF);
  return x;
}

def Divisible_by_15(unsigned int x) {
  return ((x == 0) || (mod_15ish(x) == 15));
}

You can build similar divisibility routines for 3 and 5 based on mod_15ish.

If you have 64-bit unsigned ints to deal with, extend each constant above the *1 line in the obvious way, and add a line above the *1 line to do a right shift by 32 bits with a mask of 0xFFFFFFFF. (The last two lines can stay the same) mod_15ish then obeys the same basic contract, but the return value is now between 0 and 31. (so what's maintained is that x % 15 == mod_15ish(x) % 15)

Daniel Martin
  • 23,083
  • 6
  • 50
  • 70
-2

Here are some tips I haven't see anyone else suggest yet:

One idea is to use a switch statement, or precompute some array. Then, any decent optimizer can simply index each case directly. For example:

// tests for (2,3,4,5,6,7)
switch (n % 8)
{
case 0: break;
case 1: break;
case 2: do(2); break;
case 3: do(3); break;
case 4: do(2); do(4) break;
case 5: do(5); break;
case 6: do(2); do(3); do(4); break;
case 7: do(7); break;
} 

Your application is a bit ambiguous, but you may only need to check prime numbers less than n=16. This is because all numbers are factors of the current or previous prime numbers. So for n=16, you might be able to get away with only checking 2, 3, 5, 7, 11, 13 somehow. Just a thought.

Inverse
  • 4,408
  • 2
  • 26
  • 35
  • wwhen you check 15, this algorithm says its divisible by 2, 3, and 4, but not 5. This method wont work. – Mooing Duck Jan 25 '12 at 16:28
  • Testing `n%8 == 7` isn't the same thing as `n%7 == 0`. If it was, optimizing compilers would use a simple bitwise AND when compiling `n%7 == 0`. – Peter Cordes Mar 14 '18 at 05:07