Speeding up modulo operations in CPython

Question

This is a Park-Miller pseudo-random number generator:

def gen1(a=783):
    while True:
        a = (a * 48271) % 0x7fffffff
        yield a

The 783 is just an arbitrary seed. The 48271 is the coefficient recommended by Park and Miller in the original paper (PDF: Park, Stephen K.; Miller, Keith W. (1988). "Random Number Generators: Good Ones Are Hard To Find")

I would like to improve the performance of this LCG. The literature describes a way to avoid the division using bitwise tricks (source):

A prime modulus requires the computation of a double-width product and an explicit reduction step. If a modulus just less than a power of 2 is used (the Mersenne primes 2³¹−1 and 2⁶¹−1 are popular, as are 2³²−5 and 2⁶⁴−59), reduction modulo m = 2^e − d can be implemented more cheaply than a general double-width division using the identity 2^e ≡ d (mod m).

Noting that the modulus 0x7fffffff is actually the Mersenne prime 2**32 - 1, here is the idea implemented in Python:

def gen2(a=783):
    while True:
        a *= 48271
        a = (a & 0x7fffffff) + (a >> 31)
        a = (a & 0x7fffffff) + (a >> 31)
        yield a

Basic benchmark script:

import time, sys

g1 = gen1()
g2 = gen2()

for g in g1, g2:
    t0 = time.perf_counter()
    for i in range(int(sys.argv[1])): next(g)
    print(g.__name__, time.perf_counter() - t0)

The performance is improved in pypy (7.3.0 @ 3.6.9), for example generating 100 M terms:

$ pypy lcg.py 100000000
gen1 0.4366550260456279
gen2 0.3180829349439591

Unfortunately, the performance is actually degraded in CPython (3.9.0 / Linux):

$ python3 lcg.py 100000000
gen1 20.650125587941147
gen2 26.844335232977755

My questions:

Why is the bitwise arithmetic, usually touted as an optimization, actually even slower than a modulo operation in CPython?
Can you improve the performance of this PRNG under CPython some other way, perhaps using numpy or ctypes?

Note that arbitrary precision integers are not necessarily required here because this generator will never yield numbers longer than:

>>> 0x7fffffff.bit_length()
31

I suspect with CPython, your best bet is something like `numpy` with `numba` perhaps, or just creating a C-extension (with Cython, perhaps). — juanpa.arrivillaga, Jan 26 '21 at 21:32
I got about 4x faster result with builtin `from random import getrandbits`. IPython's `%timeit getrandbits(31)` was 50ns, and your `gen1()` was 210ns. I built a C version of gen1() and gen2() and called with ctypes, but gen1() was the same speed and gen2() was still slower. — Mark Tolonen, Jan 26 '21 at 22:59
@MarkTolonen Actually, not looking for any old random number generator. I specifically need to work with this periodic sequence. I wonder how pypy is able to jit it so much better than I could achieve with ctypes, numpy, and/or numba... — wim, Jan 26 '21 at 23:34
*wonder how pypy is able to jit it so much better than I could achieve with ctypes, numpy, and/or numba* - CPython spends most of its time in the interpreter, looking at the bytecode, and at the types too. As the "tricky" code has much more steps, the interpreter overhead eats up what you gain from eliminating an integer division. A JIT compiler also emits longer code for the tricky one of course, but that time it really matters that a DIV/IDIV is really a slow instruction (while shift/add/and instructions often take a single cycle only). — tevemadar, Jan 27 '21 at 00:25
I was to suggest trying a native module, but I'm not sure if it gains enough on such a few steps. Dupe-ish topic https://stackoverflow.com/questions/9009139/optimising-multiplication-modulo-a-small-prime concludes with this suggestion, but there are no hard facts or measurements. And the hyped answer is completely off-topic for 32-bit numbers. — tevemadar, Jan 27 '21 at 11:53

score 1 · Accepted Answer · answered Jan 27 '21 at 05:14

My guess is, that in CPython-version the lion's share of time is spent for overhead (interpreter, dynamic dispatch) and not for the actual arithmetic operations. So adding more steps (i.e. more overhead) doesn't help much.

The running times of PyPy looks more like what is needed for 10^8 modulo-operations with C-integers, so it probably able to use JIT, which doesn't have much overhead and thus we can see the speed-up of arithmetic operations.

A possible way to reduce overhead is to use Cython (here is an investigation of mine how Cython can help to reduce interpreter- and dispatch-overheads), and works out of the box for generators:

%%cython
def gen_cy1(int a=783):
    while True:
        a = (a * 48271) % 0x7fffffff
        yield a
        
def gen_cy2(int a=783):
    while True:
        a *= 48271
        a = (a & 0x7fffffff) + (a >> 31)
        a = (a & 0x7fffffff) + (a >> 31)
        yield a

I use the following function for testing:

def run(gen,N):
    for i in range(N): next(gen)

and tests show:

N=10**6
%timeit run(gen1(),N)   #  246 ms
%timeit run(gen2(),N)   #  387 ms
%timeit run(gen_cy1(),N)   # 114 ms
%timeit run(gen_cy2(),N)   # 107 ms

Both Cython versions are equally fast (and somewhat faster than the original), because having more operation, doesn't really costs more overhead, as arithmetical operations are done with C-int and no longer with Python-ints.

However, if one really serious about getting the best performance - using a generator is a killer as it means a lot of overhead (see for example this SO-post).

Just to give a feeling, what could be possible if Python-generators aren't used - functions which generate all numbers (but don't convert them to Python-objects and thus without overhead):

%%cython
def gen_last_cy1(int n, int a=783):
    cdef int i
    for i in range(n):
        a = (a * 48271) % 0x7fffffff
    return a

def gen_last_cy2(int n, int a=783):
    cdef int i
    for i in range(n):
        a *= 48271
        a = (a & 0x7fffffff) + (a >> 31)
        a = (a & 0x7fffffff) + (a >> 31)
    return a

lead to the following timings:

N=10**6
%timeit gen_last_cy1(N)  # 7.21 ms
%timeit gen_last_cy2(N)  # 2.59 ms

That meas more than 90% of running time could be saved, if generator aren't used!

I was slightly surprised, that the tweaked second version outperformed the original first. Normally, C-compilers won't perform the modulo-operations directly but use bit-tricks themselves, if possible. But here, C-compiler tricks are inferior at least on my maschine.

The assembler (live on gotbold.org) generated by gcc (-O2) for the original version:

        imull   $48271, %edi, %edi
        movslq  %edi, %rdx
        movq    %rdx, %rax
        salq    $30, %rax
        addq    %rdx, %rax
        movl    %edi, %edx
        sarl    $31, %edx
        sarq    $61, %rax
        subl    %edx, %eax
        movl    %eax, %edx
        sall    $31, %edx
        subl    %eax, %edx
        movl    %edi, %eax
        subl    %edx, %eax

as one can see, there is no div.

And here assembler for the second version (with much less operations):

        imull   $48271, %edi, %eax
        movl    %eax, %edx
        sarl    $31, %eax
        andl    $2147483647, %edx
        addl    %edx, %eax
        movl    %eax, %edx
        sarl    $31, %eax
        andl    $2147483647, %edx
        addl    %edx, %eax

Clearly, less operations doesn't always mean faster code, but in this case it seems to be the case.

Speeding up modulo operations in CPython

1 Answers1