0

I've read in a document that you can replace mod operation by logical and like this :

Instead:

int Limit = Value % Range;

You do:

int Limit = Value & (Range-1);

But compilers still generate mod instructions and my question is basically : Why do compilers don't use the most efficient approach if they work the same ?

user1010005
  • 2,617
  • 5
  • 22
  • 20

3 Answers3

26

Um no... that only works when Range is a power of two.

For all other values, you still need the modulus % operator.

There are also some subtle (possibly implementation-defined) differences when working with negative numbers.


As a side note: Using the % operator is probably more readable too.

Mysticial
  • 464,885
  • 45
  • 335
  • 332
  • 1
    Thank you.I was using it recently and thought it was cool..And now i feel like an idiot ;) – user1010005 Apr 10 '12 at 22:55
  • 4
    Why is this attracting so many votes? (Hey I'm not complaining...) – Mysticial Apr 10 '12 at 23:09
  • @user1010005 I've seen more broken "optimized" code due to using `&` instead of modulo for power of 2s when the input could be negative than I want to imagine.. it's such a simple optimization just leave it to the compiler. @Mysticial Simple questions and answers always attract the highest number of upvotes in my experience (presumably because basically anyone visiting the site understands the answer on first glance and is hence more likely to upvote it). If you're after upvotes, answering complex SSE optimization questions is the worst possible way to get it ;) – Voo Apr 10 '12 at 23:10
  • @Voo: When the input can be negative, `&` gives the correct remainder and `%` gives the wrong result... I've actually seen more bugs in the other direction: using `%` when either `&` (or slow and ugly fixup code after the `%`) is needed. – R.. GitHub STOP HELPING ICE Apr 10 '12 at 23:12
  • 1
    @R.. Not if you want truncating division results (which obviously was the case otherwise using & wouldn't have been a bug, would it ;) ). Although I agree that I'd be much more happy with modulus or floor division as default for programming languages (and yes I've seen lots of indexing on ints with % too in c/java..). – Voo Apr 10 '12 at 23:13
  • 3
    I've never found a case where I wanted the result the `%` operator gives. Presumably there are some when you're using integers as fixed point and such, but the most common usage cases I find for the remainder are (1) breaking down/converting time representations, and (2) dividing objects into N equivalence classes, e.g. for a hash table or such. In both these cases, getting both positive and negative values (2N-1 possible values rather than N values) complicates correct handling. – R.. GitHub STOP HELPING ICE Apr 10 '12 at 23:17
  • @R.. : Is this not why unsigned types exist? – ildjarn Apr 10 '12 at 23:21
  • @ildjarn Some languages don't have unsigned types and even those that have it - signed is the default. No I agree that truncating division is not what you want in 99% of all cases and I'm quite happy with eg Python not going that way. There are special cases (and they do happen), but as default modulo/floor (negative mods are rare and they're the same otherwise so who cares?) div all the way. PS: Also for those who cares: It was some fixpoint computation rewritten from c in java. – Voo Apr 10 '12 at 23:24
  • @Voo : This question is tagged `c` and `c++`, so I assumed we were limiting our discussion to those two languages. ;-] – ildjarn Apr 10 '12 at 23:26
  • @ildjarn Sorry I thought we were discussing that in general ;) – Voo Apr 10 '12 at 23:38
  • @ildjarn: unsigned types will not help in the examples I gave. For example if your initial quantity is a time (relative to some base point, e.g. the epoch) in nanoseconds, and you want to convert it to `struct timespec` format, it may be negative, and first converting it to an unsigned type will reduce it modulo `ULLONG_MAX+1` or similar. Since `ULLONG_MAX+1` is not a multiple of 1000000000, the modular reductions do not commute, and you get the wrong answer. – R.. GitHub STOP HELPING ICE Apr 11 '12 at 02:35
14

you can replace modulo with that only if it is a power of 2. Using elementary math to replace it without a modulo

a = b % c;

can be done with

x = b % c;
a = b / (x*c);

Lets check this with an example

25 % 7 = 
25 / 7 = 3 (integer math)
25 - (3 * 7) =
25 - 21 = 4

Which is how I have to do it on my calculator anyway as I dont have a modulo operator.

Note that

25 & (7-6) = 
0x19 & 0x6 = 0x0

So your substitution does not work.

Not only do most processors not have a modulo, many do not have a divide. Check out the hackers delight book.

WHY would you want modulo? If you have burned the hardware to make a divide, you might be willing to go that extra mile to add modulo as well. Most processors take your question to the next level, why would you implement a divide in hardware when it can be done in software. The answer to your question is most processor families do not have a modulo, and many do not have a divide because it is not worth the chip real estate, power consumed, etc compared to the software solution. The software solution is less painful/costly/risky.

Now I assume your question is not what the winning poster answered. For cases where the Range is a power of two and the identity does work... First off if range is not known at compile time then you have to do a subtract and an and, two operations, and maybe an intermediate variable, that is much more costly than a modulo, the compiler would be in error to optimize to a subtract and and instead of a modulo. If the range is a power of two and is known at compile time your better/fancier compilers will optimize. There are times, esp with a variable word length instruction set where the smaller instruction may be used over the larger instruction, it might be less painful to load Range and do a modulo than to load the larger number of non-zero bits (values of Range that match your identity have a single bit set in the value, the other bits are zero, 0x100, 0x40, 0x8000, etc) and do the modulo. the load immediate plus modulo might be cheaper than the load immediate plus and, or the modulo immediate might be cheaper than the and immediate. You have to examine the instruction set and how the compiler has implemented the solution.

I suggest you post some examples of where it is not doing the optimization, and I assume we can post many examples of where the compiler has done the optimization you were expecting.

old_timer
  • 69,149
  • 8
  • 89
  • 168
  • Wow that was a great explanation.Thank you very much! I'm picking this as best answer D: – user1010005 Apr 11 '12 at 06:47
  • 1
    Another comment might be that dont assume the modulo instruction is more costly than the and instruction. In the old days yes a divide with modulo was many many more clocks, today that is not necessarily the case. it is not hard to code the solution both ways let the compiler at it and see what the differences in the instrucitons produced, that might show you why the compiler did not do it. – old_timer Apr 12 '12 at 01:37
  • 1
    another thing to think about is the divide/modulo say for example the processor has a 32 bit divide that produces a 16 bit result and a 16 bit remainder. A divide by 2 or divide by 4 or a lot of numbers would not work, the divide wouldnt fit, at compile time the compiler should know that. Likewise a modulo 0x40000000 would not work either so an and would be better no matter how many instructions it costs (Because it would implement correctly). So even if the instructions might be (not saying they are) better to use the modulo it might not sometimes. – old_timer Apr 12 '12 at 01:39
  • pay it forward. wander around stack overflow or other sites and share your knowledge with others. – old_timer Apr 13 '12 at 14:35
0

As others have stated, the range has to be 2^n-1, and even then, if it's done at run-time, you have problems.

On recent architectures (let's say, anything after P4 era) the latency on integer division instructions is between 26 and 50 or so cycles worst case. A multiply, in comparison, can be 1-3 cycles and can often be done in parallel much better.

The DIV instruction returns the quotient in EAX and the remainder in EDX. The "remainder" is free (the modulus is the remainder).

If you implement something where the range is variable at run-time, if you wish to use &, you have to:

a) check if the range is 2^n-1, if so use your & codepath: which is a branch, possible cache miss etc. etc. adding huge latency potential b) if it is not 2^n-1, use a DIV instruction

Using a DIV instead of adding a branch into the equation (which is the potential to cost hundreds or even thousands of cycles in bad cases with poor cache eviction) makes DIV the obvious best choice. On top of that, if you are using & with a signed data type, conversions will be necessary (there is no & for mixed data types but there are for DIVs). In addition if the DIV is only used to branch from the modulus and the rest of the results aren't used, speculative execution can perform nicely; also performance penalties are further mitigated by multiple pipeline that can execute instructions in parallel.

You have to remember that if you are using real code, a lot of your cache will be filled with the data you are working on, and other code and data you will be working with soon or have just worked on. You really don't want to be evicting cache pages and waiting for them to page in because of branch mispredictions. In most cases with modulo, you are not just going i = 7; d = i % 4; you're using larger code that often calls a subroutine which itself is a (predicted and cached) subroutine call directly before. In addition you're probably doing it in a loop which itself is also using branch prediction; nested branch predictions with loops are handled pretty well in modern microprocessors but it just ends up being plain stupid to add to the predicting it's trying to do.

So to summarize, using DIV makes more sense on modern processors for a general usage case; it is not really an "optimization" for a compiler to generate 2^n-1 because of cache considerations and other stuff. If you really really need to fine-tune that integer divide, and your whole program depends on it, you will end up hard-coding the divisor to 2^n-1 and making bitwise & logic yourself.

Finally, this is a bit of a rant - a dedicated ALU unit for integer divides can really reduce the latency to around 6-8 cycles, it just takes up a relatively large die area because the data path ends up being about 128 bits wide and nobody has the real estate for it when integer DIVs work just fine how they are.

std''OrgnlDave
  • 3,912
  • 1
  • 25
  • 34