Is divmod() faster than using the % and // operators?

Question

I remember from assembly that integer division instructions yield both the quotient and remainder. So, in python will the built-in divmod() function be better performance-wise than using the % and // operators (suppose of course one needs both the quotient and the remainder)?

q, r = divmod(n, d)
q, r = (n // d, n % d)

Function call overhead (`CALL_FUNCTION` in bytecode) probably will cause slower execution on first version, but `timeit` is a way to be sure. — Łukasz Rogalski, May 06 '15 at 14:43

Martijn Pieters · Accepted Answer · 2016-12-22T16:39:31.213

To measure is to know (all timings on a Macbook Pro 2.8Ghz i7):

>>> import sys, timeit
>>> sys.version_info
sys.version_info(major=2, minor=7, micro=12, releaselevel='final', serial=0)
>>> timeit.timeit('divmod(n, d)', 'n, d = 42, 7')
0.1473848819732666
>>> timeit.timeit('n // d, n % d', 'n, d = 42, 7')
0.10324406623840332

The divmod() function is at a disadvantage here because you need to look up the global each time. Binding it to a local (all variables in a timeit time trial are local) improves performance a little:

>>> timeit.timeit('dm(n, d)', 'n, d = 42, 7; dm = divmod')
0.13460898399353027

but the operators still win because they don't have to preserve the current frame while a function call to divmod() is executed:

>>> import dis
>>> dis.dis(compile('divmod(n, d)', '', 'exec'))
  1           0 LOAD_NAME                0 (divmod)
              3 LOAD_NAME                1 (n)
              6 LOAD_NAME                2 (d)
              9 CALL_FUNCTION            2
             12 POP_TOP             
             13 LOAD_CONST               0 (None)
             16 RETURN_VALUE        
>>> dis.dis(compile('(n // d, n % d)', '', 'exec'))
  1           0 LOAD_NAME                0 (n)
              3 LOAD_NAME                1 (d)
              6 BINARY_FLOOR_DIVIDE 
              7 LOAD_NAME                0 (n)
             10 LOAD_NAME                1 (d)
             13 BINARY_MODULO       
             14 BUILD_TUPLE              2
             17 POP_TOP             
             18 LOAD_CONST               0 (None)
             21 RETURN_VALUE

The // and % variant uses more opcodes, but the CALL_FUNCTION bytecode is a bear, performance wise.

In PyPy, for small integers there isn't really much of a difference; the small speed advantage the opcodes have melts away under the sheer speed of C integer arithmetic:

>>>> import platform, sys, timeit
>>>> platform.python_implementation(), sys.version_info
('PyPy', (major=2, minor=7, micro=10, releaselevel='final', serial=42))
>>>> timeit.timeit('divmod(n, d)', 'n, d = 42, 7', number=10**9)
0.5659301280975342
>>>> timeit.timeit('n // d, n % d', 'n, d = 42, 7', number=10**9)
0.5471200942993164

(I had to crank the number of repetitions up to 1 billion to show how small the difference really is, PyPy is blazingly fast here).

However, when the numbers get large, divmod() wins by a country mile:

>>>> timeit.timeit('divmod(n, d)', 'n, d = 2**74207281 - 1, 26', number=100)
17.620037078857422
>>>> timeit.timeit('n // d, n % d', 'n, d = 2**74207281 - 1, 26', number=100)
34.44323515892029

(I now had to tune down the number of repetitions by a factor of 10 compared to hobbs' numbers, just to get a result in a reasonable amount of time).

This is because PyPy no longer can unbox those integers as C integers; you can see the striking difference in timings between using sys.maxint and sys.maxint + 1:

>>>> timeit.timeit('divmod(n, d)', 'import sys; n, d = sys.maxint, 26', number=10**7)
0.008622884750366211
>>>> timeit.timeit('n // d, n % d', 'import sys; n, d = sys.maxint, 26', number=10**7)
0.007693052291870117
>>>> timeit.timeit('divmod(n, d)', 'import sys; n, d = sys.maxint + 1, 26', number=10**7)
0.8396248817443848
>>>> timeit.timeit('n // d, n % d', 'import sys; n, d = sys.maxint + 1, 26', number=10**7)
1.0117690563201904

Any idea on why `divmod` completely destroyed `//` and `%` with large numbers, though? — Dimitris Fasarakis Hilliard, Dec 25 '16 at 13:35
@JimFasarakis-Hilliard: large number support. You can't use a basic C `&` masking op anymore on a C integer value, so the usual optimisations go out the window. You have to use an algorithm that's directly proportional to the size of the int, and doing it once when doing a divmod vs. doing it twice with the individual operators hurts badly there. — Martijn Pieters, Dec 25 '16 at 14:03
@AnnZen are you using PyPy or is the code on the critical path and needs to perform as fast as possible? If not, pick whatever you feel communicates the intent better. — Martijn Pieters, Feb 23 '21 at 07:21
@MartijnPieters The code doesn't need to be as fast as possible... but I actually need the coordinates reversed. Would `divmod(a, b)[::-1]` communicate the intent better than `a // b, a % b`? Thank you! — Red, Feb 23 '21 at 13:10
@AnnZen I would say no. `(a // b, a % b)` communicates intent much better than `divmod(a, b)[::-1]`. I honestly wouldn't remember the order of the output of `divmod`. When assigning the output to two variables, the variable names usually make it clear. — nog642, Oct 05 '21 at 15:54

score 26 · Answer 2 · answered Aug 02 '16 at 04:47

Martijn's answer is correct if you're using "small" native integers, where arithmetic operations are very fast compared to function calls. However, with bigints, it's a whole different story:

>>> import timeit
>>> timeit.timeit('divmod(n, d)', 'n, d = 2**74207281 - 1, 26', number=1000)
24.22666597366333
>>> timeit.timeit('n // d, n % d', 'n, d = 2**74207281 - 1, 26', number=1000)
49.517399072647095

when dividing a 22-million-digit number, divmod is almost exactly twice as fast as doing the division and modulus separately, as you might expect.

On my machine, the crossover occurs somewhere around 2^63, but don't take my word for it. As Martijn says, measure! When performance really matters, don't assume that what held true in one place will still be true in another.

Matias Andina · Answer 3 · 2022-10-28T22:40:09.577

I am iterating over an array pulse_onset to calculate divmod.

pulse_onset.shape
(14307,)

Using these two variants, divmod is much faster. My guess is that iterating twice over the array is worse than using divmod, but I am not familiar enough with the internals.

t0 = time.time()
div_array = np.array([divmod(i, 256) for i in pulse_onset])
t1 = time.time()
total = t1-t0
print(total)
0.008155584335327148

t0 = time.time()
pulse_onset_int = np.array([round(i / 256) for i in pulse_onset])
remainder = np.array([i % 256 for i in pulse_onset])
t1 = time.time()
0.5383238792419434

Doing something like

np.array([(round(i / 256), i % 256) for i in pulse_onset])

Shaves a little bit of the time (0.39537501335144043), so it might not be mostly iterating over the array twice but doing the division twice what has more weight.

Is divmod() faster than using the % and // operators?

3 Answers3

Linked