Why is Math.DivRem so inefficient?

Question

In my computer this code takes 17 seconds (1000 millions times):

static void Main(string[] args) {
   var sw = new Stopwatch(); sw.Start();
   int r;
   for (int i = 1; i <= 100000000; i++) {
      for (int j = 1; j <= 10; j++) {
         MyDivRem (i,j, out r);
      }
   }
   Console.WriteLine(sw.ElapsedMilliseconds);
}

static int MyDivRem(int dividend, int divisor, out int remainder) {
   int quotient = dividend / divisor;
   remainder = dividend - divisor * quotient;
   return quotient;
}

while Math.DivRem takes 27 seconds.

.NET Reflector gives me this code for Math.DivRem:

public static int DivRem(int a, int b, out int result)
{
    result = a % b;
    return (a / b);
}

CIL

.method public hidebysig static int32 DivRem(int32 a, int32 b, [out] int32& result) cil managed
{
    .maxstack 8
    L_0000: ldarg.2
    L_0001: ldarg.0
    L_0002: ldarg.1
    L_0003: rem
    L_0004: stind.i4
    L_0005: ldarg.0
    L_0006: ldarg.1
    L_0007: div
    L_0008: ret
}

Theoretically it may be faster for computers with multiple cores, but in fact it shouldn't need to do two operations in the first place, because x86 CPUs return both the quotient and remainder when they do an integer division using DIV or IDIV (http://www.arl.wustl.edu/~lockwood/class/cs306/books/artofasm/Chapter_6/CH06-2.html#HEADING2-451)!

You can't tell that by looking at the IL. Rather, you need to see what the JIT compiler actually produces. — Mehrdad Afshari, Apr 26 '09 at 16:27
Omg, I can't believe this! Today I've noticed in my app that calling DivRem is slightly slower than simply doing / and %. Now I tested your DivRem function and it's indeed considerably faster than both! (~20% on my PC.) — michalburger1, Sep 01 '10 at 13:19
x64 is a superset of x86 and if the CPU is not compatible with x86 is just a matter of using different code for the .net framework for that CPU — ggf31416, Jan 15 '09 at 16:01
Right but .NET is also implemented as Mono and should therefore run on other archs such as ppc etc. — André, Jan 15 '09 at 20:41

score 20 · Accepted Answer · answered Jan 15 '09 at 16:06

20

Grrr. The only reason for this function to exist is to take advantage of the CPU instruction for this, and they didn't even do it!

answered Jan 15 '09 at 16:06

Joshua

40,822
8
72
132

1

Wouldn't you have to look at the IL to prove this? – Austin Salonen Jan 15 '09 at 16:08
2

No. If they did it that way .NET Reflector would tell you that this function is in the native stubs. – Joshua Jan 15 '09 at 17:01
(for the record, though, the jitter doesn't actually do this in .NET 4: the disassembly includes two `idiv`s, two `cdq`s and a whole bunch of `mov`s) – Roman Starkov Aug 03 '11 at 11:12
@romkyns: If the jitter could do it for DivRem, then it could do it for your code. – Joshua Jan 04 '13 at 20:07
4

Actually, the function could be useful even without a CPU function to yield remainder and quotient, if it started by computing the quotient, multiplied that by the divisor, and subtracted from the dividend. Multiplications are faster than divisions on almost every platform, and on some, there's a full order-of-magnitude difference in speed so using a divide, multiply, and subtract may be almost twice as fast as using two divides. – supercat Jan 31 '14 at 17:43
5

On a related note, though, I think it's unfortunate that so few languages provide any support for multiplies whose product size is larger than the operands, or divides whose dividend size is larger than the divisor. One can scale up the a multiply's operands to the result size, or scale up the divisor's size to match the dividend, but many processors have instructions for mixed-sized operands and it's a shame not to use them. – supercat Jan 31 '14 at 17:45
2

@supercat At least some versions of the .net jitter optimize the 32bit*32bit=>64 bit when you cast the operands from 32 to 64 bits immediately prior to the multiplication. No need for language support. On the other hand the 128 bit type required to hold the result of a 64 bit multiplication is sorely missing. – CodesInChaos Sep 16 '15 at 17:39
2

@CodesInChaos: Even if peephole optimizers can replace certain constructs with hardware intrinsics, unless a language standardizes certain forms which code generators are strongly encouraged to peephole-optimize, one may easily end up in a situation where the fastest way to write a piece of code on each of two "compatible" compilers ends up being much slower on the other. I dislike the idea that programmers should request operations they don't want in order to get operations they do want, in the hope that the optimizer will omit the operations the programmers never wanted in the first place. – supercat Sep 16 '15 at 18:21
I was looking much more simply. If a peephole optimizer can catch the form, it ought to be able to catch the very same form IN YOUR OWN CODE. – Joshua Sep 16 '15 at 19:23

Bob · Answer 2 · 2020-08-05T15:44:23.080

15

While .NET Framework 4.6.2 still uses the suboptimal modulo and divide, .NET Core (CoreCLR) currently replaces the divide with a subtract:

    public static int DivRem(int a, int b, out int result)
    {
        // TODO https://github.com/dotnet/runtime/issues/5213:
        // Restore to using % and / when the JIT is able to eliminate one of the idivs.
        // In the meantime, a * and - is measurably faster than an extra /.

        int div = a / b;
        result = a - (div * b);
        return div;
    }

And there's an open issue to either improve DivRem specifically (via intrinsic), or detect and optimise the general case in RyuJIT.

edited Aug 05 '20 at 15:44

answered Jan 23 '17 at 11:34

Bob

15,441
3
26
42

Why did he put parentheses around `div * b`? – Anton Shepelev Jan 05 '18 at 12:52
@jnm2: Well, I thought one should rely on operator precedence and use parentheses only when the intended precedence differs from that used in the language... – Anton Shepelev Feb 10 '18 at 11:12
@Ant_222 I disagree with that. The usual guidance is to optimize for human readers first. Anything that makes the code clearer is preferable. – jnm2 Feb 10 '18 at 14:33
@jnm2: But redundant parentheses produce clutter and thwart perception. Precedence is used to reduce the amount of parentheses. – Anton Shepelev Feb 11 '18 at 20:41
@Ant_222 Of course this is subjective and I don't see the things you're seeing, but I'm also a proponent of erroring on the side of clarity for all future readers. – jnm2 Feb 12 '18 at 00:42

score 14 · Answer 3 · answered Jan 15 '09 at 17:47

Wow, that really looks stupid, doesn't it?

The problem is that -- according to the Microsoft Press book ".NET IL Assembler" by Lidin -- the IL rem and div atithmetic instructions are exactly that: compute remainder and compute divisor.

All arithmetical operations except the negation operation take two operands from the stack and put the result on the stack.

Apparently, the way the IL assembler language is designed, it's not possible to have an IL instruction that produces two outputs and pushes them onto the eval stack. Given that limitation, you can't have a division instruction in IL assembler that computes both the way the x86 DIV or IDIV instructions do.

IL was designed for security, verifiability, and stability, NOT for performance. Anyone who has a compute-intensive application and is concerned primarily with performance will be using native code and not .NET.

I recently attended Supercomputing '08, and in one of the technical sessions, an evangelist for Microsoft Compute Server gave the rough rule of thumb that .NET was usually half the speed of native code -- which is exactly the case here!.

While this is true, there is no reason that `Math.DivRem` can't be implemented in the runtime and marked with `[MethodImpl(MethodImplOptions.InternalCall), SecuritySafeCritical]` like _many_ of the other methods on `System.Math` are. — codekaizen, Dec 28 '10 at 23:26
Do you have a source for the claim that IL is inherently unable to produce two values on the stack? I'm not saying it's false; it's easy to imagine the JITter making heavy use of this assumption, but a proper source would be handy. — Roman Starkov, Jan 04 '13 at 22:24
That's true, but DivRem could return a signle struct with Div and Rem fields. It seems to be faster anyway, and a method without `out` params should be better. Well I think it's not the question in .Net. — Alex Zhukovskiy, Feb 21 '15 at 13:52

score 3 · Answer 4 · answered Jan 15 '09 at 16:37

If I had to take a wild guess, I'd say that whoever implemented Math.DivRem had no idea that x86 processors are capable of doing it in a single instruction, so they wrote it as two operations. That's not necessarily a bad thing if the optimizer works correctly, though it is yet another indicator that low-level knowledge is sadly lacking in most programmers nowadays. I would expect the optimizer to collapse modulus and then divide operations into one instruction, and the people who write optimizers should know these sorts of low-level things...

just like the compiler or the JIT should replace Math.Pow(x,3) with x*x*x but that seems be asking too much... — ggf31416, Jan 15 '09 at 17:27

score 2 · Answer 5 · answered Jan 15 '09 at 16:00

2

The answer is probably that nobody has thought this a priority - it's good enough. The fact that this has not been fixed with any new version of the .NET Framework is an indicator of how rarely this is used - most likely, nobody has ever complained.

answered Jan 15 '09 at 16:00

Sander

25,685
3
53
85

codekaizen · Answer 6 · 2010-12-28T23:23:41.660

This is really just a comment, but I don't get enough room.

Here is some C# using Math.DivRem():

    [Fact]
    public void MathTest()
    {
        for (var i = 1; i <= 10; i++)
        {
            int remainder;
            var result = Math.DivRem(10, i, out remainder);
            // Use the values so they aren't optimized away
            Assert.True(result >= 0);
            Assert.True(remainder >= 0);
        }
    }

Here is the corresponding IL:

.method public hidebysig instance void MathTest() cil managed
{
    .custom instance void [xunit]Xunit.FactAttribute::.ctor()
    .maxstack 3
    .locals init (
        [0] int32 i,
        [1] int32 remainder,
        [2] int32 result)
    L_0000: ldc.i4.1 
    L_0001: stloc.0 
    L_0002: br.s L_002b
    L_0004: ldc.i4.s 10
    L_0006: ldloc.0 
    L_0007: ldloca.s remainder
    L_0009: call int32 [mscorlib]System.Math::DivRem(int32, int32, int32&)
    L_000e: stloc.2 
    L_000f: ldloc.2 
    L_0010: ldc.i4.0 
    L_0011: clt 
    L_0013: ldc.i4.0 
    L_0014: ceq 
    L_0016: call void [xunit]Xunit.Assert::True(bool)
    L_001b: ldloc.1 
    L_001c: ldc.i4.0 
    L_001d: clt 
    L_001f: ldc.i4.0 
    L_0020: ceq 
    L_0022: call void [xunit]Xunit.Assert::True(bool)
    L_0027: ldloc.0 
    L_0028: ldc.i4.1 
    L_0029: add 
    L_002a: stloc.0 
    L_002b: ldloc.0 
    L_002c: ldc.i4.s 10
    L_002e: ble.s L_0004
    L_0030: ret 
}

Here is the (relevant) optimized x86 assembly generated:

       for (var i = 1; i <= 10; i++)
00000000  push        ebp 
00000001  mov         ebp,esp 
00000003  push        esi 
00000004  push        eax 
00000005  xor         eax,eax 
00000007  mov         dword ptr [ebp-8],eax 
0000000a  mov         esi,1 
        {
            int remainder;
            var result = Math.DivRem(10, i, out remainder);
0000000f  mov         eax,0Ah 
00000014  cdq 
00000015  idiv        eax,esi 
00000017  mov         dword ptr [ebp-8],edx 
0000001a  mov         eax,0Ah 
0000001f  cdq 
00000020  idiv        eax,esi

Note the 2 calls to idiv. The first stores the remainder (EDX) into the remainder parameter on the stack. The 2nd is to determine the quotient (EAX). This 2nd call is not really needed, since EAX has the correct value after the first call to idiv.

Austin Salonen · Answer 7 · 2009-01-15T16:45:37.357

1

Does anyone else get the opposite when testing this?

Math.DivRem = 11.029 sec, 11.780 sec
MyDivRem = 27.330 sec, 27.562 sec
DivRem = 29.689 sec, 30.338 sec

FWIW, I'm running an Intel Core 2 Duo.

The numbers above were with a debug build...

With the release build:

Math.DivRem = 10.314
DivRem = 10.324
MyDivRem = 5.380

Looks like the "rem" IL command is less efficient than the "mul,sub" combo in MyDivRem.

edited Jan 15 '09 at 16:45

answered Jan 15 '09 at 16:01

Austin Salonen

49,173
15
109
139

@leppie -- release build data added but the numbers were accurate for the debug build. – Austin Salonen Jan 15 '09 at 16:49
MyDivRem consistently runs in the low- to mid-5's – Austin Salonen Jan 15 '09 at 16:52

Chris Ammerman · Answer 8 · 2009-01-15T16:57:02.527

1

The efficiency may very well depend on the numbers involved. You are testing a TINY fraction of the available problem space, and all front-loaded. You are checking the first 1 million * 10 = 1 billion contiguous input combinations, but the actual problem space is approx 4.2 billion squared, or 1.8e19 combinations.

The performance of general library math operations like this needs to be amortized over the whole problem space. I'd be interested to see the results of a more normalized input distribution.

edited Jan 15 '09 at 16:57

answered Jan 15 '09 at 16:50

Chris Ammerman

14,978
8
41
41

Plus, I'd say performing 1 billion runs in under 30 seconds is pretty good, so what's the fuss? – Michael Haren Jan 15 '09 at 17:52
1

I tested almost the entire space incrementing each variable by a large primes and Math.DivRem is still inefficient – ggf31416 Jan 15 '09 at 19:54

score 0 · Answer 9 · answered Mar 26 '10 at 11:52

I would guess that the majority of the added cost is in the set-up and tear-down of the static method call.

As for why it exists, I would guess that it does in part for completeness and in part for the benefit of other languages that may not have easy to use implementations of integer division and modulus computation.

score 0 · Answer 10 · edited Oct 23 '17 at 07:46

0

Here are my numbers:

15170 MyDivRem
29579 DivRem (same code as below)
29579 Math.DivRem
30031 inlined

The test was slightly changed; I added assignment to the return value and was running release build.

Core 2 Duo 2.4

Opinion:

You seemed to have found a nice optimization ;)

edited Oct 23 '17 at 07:46

Peter Mortensen

30,738
21
105
131

answered Jan 15 '09 at 16:25

leppie

115,091
17
196
297

score -3 · Answer 11 · edited Oct 23 '17 at 07:47

-3

It's partly in the nature of the beast. There is to the best of my knowledge no general quick way to calculate the remainder of a division. This is going to take a correspondingly large amount of clock cycles, even with x hundred million transistors.

edited Oct 23 '17 at 07:47

Peter Mortensen

30,738
21
105
131

answered Jan 15 '09 at 16:42

howlingmadhowie

63
2

If you have the result, getting the remainder is a mul-sub which is faster than another div. – Joshua Jan 17 '09 at 16:43
yeah, but you need to get that value first. i can't see any way of doing this in less than O(ln(n)) time. can you? – howlingmadhowie Jan 17 '09 at 17:08
This is indeed true. You can't get the remainder without a division unless you want the remainder of a power of two. – Thorham Jan 20 '21 at 14:35

Why is Math.DivRem so inefficient?

CIL

11 Answers11

Linked