Why is this LCG significantly faster in Python 2.7 than in Python 3.x?

Question

This is a simple linear congruential generator in Python:

def prng(n):
    # https://en.wikipedia.org/wiki/Lehmer_random_number_generator
    while True:
        n = n * 48271 % 0x7fffffff
        yield n

g = prng(123)
for i in range(10**8):
    next(g)

assert next(g) == 1062172093

Python 2.7 is faster than any version of Python 3.x here. Producing 100M terms on Linux:

python2.7 g.py   14.60s user 0.65s system 99% cpu 15.260 total
python3.6 g.py   18.70s user 0.00s system 99% cpu 18.711 total
python3.7 g.py   18.10s user 0.04s system 99% cpu 18.193 total
python3.8 g.py   19.22s user 0.02s system 99% cpu 19.273 total
python3.9 g.py   18.90s user 0.00s system 99% cpu 18.924 total
python3.10 g.py  18.29s user 0.00s system 99% cpu 18.311 total
python3.11 g.py  16.93s user 0.00s system 99% cpu 16.946 total

Why are CPython 3.x interpreters so much slower at executing this code?

Just to be clear, you're trying to consume 1e8 numbers, then print one, right? — Mad Physicist, May 25 '21 at 20:33
@MadPhysicist I think he's just doing lots of repetitions for the benchmark. — Barmar, May 25 '21 at 20:34
@Barmar. Ah. So we're assuming int vs long is the problem here, not the new generator API — Mad Physicist, May 25 '21 at 20:37
@MadPhysicist That's the question: What's the source of the inefficiency here? — Barmar, May 25 '21 at 20:38
To the OP: Do you see the same problem if you use an ordinary loop rather than a generator? — Barmar, May 25 '21 at 20:39
Also, what if you consume it with itertools.islice fed into a deque(..., maxlen=0)? — Mad Physicist, May 25 '21 at 20:41
I tried with just a simple loop and get 21 s with Python 2 vs 44 s with Python 3, which confirms wim's timings. — Thierry Lathuille, May 25 '21 at 20:46
Consuming with `deque(islice(g, 10**8), maxlen=0)`, the Python 3 code is now *460% slower* :-\ — wim, May 25 '21 at 20:49
you can get 20-30% of performance if you use a loop instead of a generator — Hoxha Alban, May 25 '21 at 20:55
@HoxhaAlban To filter out the looping mechanism, compare this with a generator that just does `yield 1` — Barmar, May 25 '21 at 20:58
@MadPhysicist To clarify: 460% slower than the same code on Python 2 (i.e. the time spent in generator consumption is reduced, so the inefficiencies of the arithmetic are magnified in comparison). — wim, May 25 '21 at 21:32
[My results are different than OP](https://imgur.com/a/IDU30JP) At least on python 3.7.3 vs 2.7.16 — Yoshikage Kira, May 25 '21 at 21:33
@wim. I've added the deque version to my answer to clarify what I mean — Mad Physicist, May 25 '21 at 21:37
@MadPhysicist Seconds. I used timeit module to time the code. — Yoshikage Kira, May 25 '21 at 21:43
@wim. And now I've updated with `float` instead of numpy. It does a much better job being a fixed-width type than `np.uint64` apparently. — Mad Physicist, May 25 '21 at 22:05
@MadPhysicist Yeah. It is a comment. Did I post it as answer? No. OP assumed that CPython 3.x interpreter is slower than 2.x which is not true. Sure, you made bunch of optimization to speed up the python3 code. But what if you made same optimization in Python2 code. Is python2 going to be faster or not? It is hard to tell why is op's case py3 slower than py2 with certainty as there could be important information we are overlooking. — Yoshikage Kira, May 25 '21 at 22:23
@Goion. I think we have a couple of misunderstandings. I wasn't accusing you of not answering OP's question, just my comment. I'm still not sure what the actual numbers are. I totally agree that I'm not explaining much, except that the part where float is better than int is very unambiguous. Technically though, I'm pretty close to being able to answer the literal question in the title, even if that's not what wim is really asking — Mad Physicist, May 25 '21 at 22:34
It would be helpful if OP answered more specific questions. I ran some more test to [verify my result](https://imgur.com/a/4x0X94z). [My code](https://pastebin.com/LFPcjyE0) @ThierryLathuille What machine did you test this on? — Yoshikage Kira, May 25 '21 at 23:14

score 5 · Answer 1 · edited Jun 03 '21 at 18:37

I don't have a py2 to play with, so the following benchmarks just compare different implementation details in py3. All the benchmarks are done in IPython 7.22.0 running a Python 3.8.8 kernel using time.process_time. I took the middle of three times for each run. Results are meaningful to about 1sec, or ~3% accuracy.

Original code, loop takes 35.36sec.

You can make all the numbers into the appropriate fixed width numpy types. That way, you avoid the implicit conversion of all python 2 fixed-width ints into python 3 infinite-precision ints:

def prng(n):
    # https://en.wikipedia.org/wiki/Lehmer_random_number_generator
    a = np.uint64(48271)
    b = np.uint64(0x7fffffff)
    n = np.uint64(n)
    while True:
        n = n * a % b
        yield n

g = prng(123)
p = process_time()
for i in range(10**8):
    next(g)
q = process_time()
print(q - p, ':', next(g))

The runtime is reduced to 28.05s: a drop of ~21%. BTW, using global a and b drops the time by only ~5% to 33.55s.

As @Andrej Kesely suggested, a better way to simulate the fixed-width ints of py2 is using float in py3, rather than calling on numpy's dispatching machinery every time:

def prng(n):
    # https://en.wikipedia.org/wiki/Lehmer_random_number_generator
    while True:
        n = n * 48271.0 % 2147483647.0
        yield n

g = prng(123.0)
p = process_time()
for i in range(10**8):
    next(g)
q = process_time()
print(q - p, ':', next(g))

And in fact, we see a runtime of 23.63s, which represents a 33% reduction from the original.

To bypass the generator API, let's rewrite the loop without the generator:

n = 123
p = process_time()
for i in range(10**8):
    n = n * 48271 % 0x7fffffff
q = process_time()
print(q - p, n * 48271 % 0x7fffffff)

This runtime is only 26.28s, which is an improvement of ~26%.

Doing the same thing, but with a function call will only save you ~3% (runtime of 34.33s):

def prng(n):
    return n * 48271 % 0x7fffffff

n = 123
p = process_time()
for i in range(10**8):
    n = prng(n)
q = process_time()
print(q - p, prng(n))

Using float speeds up the function version as much as it did the generator:

def prng(n):
    return n * 48271.0 % 2147483647.0

n = 123.0
p = process_time()
for i in range(10**8):
    n = prng(n)
q = process_time()
print(q - p, prng(n))

A runtime of 22.97s is a drop of an additional 33%, just like we saw with the generator.

Running the loop-only solution using float also helps a lot:

n = 123.0
p = process_time()
for i in range(10**8):
    n = n * 48271.0 % 2147483647.0
q = process_time()
print(q - p, n * 48271.0 % 2147483647.0)

The runtime is 12.72s, which is a 64% drop from the original, and a 52% down from the int loop version.

Clearly the datatype is a significant source of slowness here, but it is also quite likely that python 3's generator machinery adds another 20% or so to the runtime as well. Removing both of those sources of slowness allows us to get a result that is better than half the runtime of the original code.

It is not completely clear how much of the remainder after removing infinite precision types is caused by the generator vs the for loop machinery. So let's get rid of the for loop to see what happens:

from itertools import islice
from collections import deque

def prng(n):
    # https://en.wikipedia.org/wiki/Lehmer_random_number_generator
    while True:
        n = n * 48271 % 0x7fffffff
        yield n

g = prng(123)
p = process_time()
deque(islice(g, 10**8), maxlen=0)
q = process_time()
print(q - p, ':', next(g))

The runtime is 21.32s, is 40% faster than the original code, indicating that the for implementation may have become more robust, and therefore more cumbersome in py3 as well.

It gets even better with float in prng (exactly as in the first example). Now the runtime is 10.09s, which is a drop of 71%, or ~3x faster than the original code.

One more testable difference, suggested by @chepner is that in py2's, range(10**8) is equivalent to py3's list(range(10**8)). This is important for the exact reason that generators seem to be slower in py3.

def prng(n):
    # https://en.wikipedia.org/wiki/Lehmer_random_number_generator
    while True:
        n = n * 48271.0 % 2147483647.0
        yield n

g = prng(123.0)
r = list(range(10**8))
p = process_time()
for i in r:
    next(g)
q = process_time()
print(q - p, ':', next(g))

This version takes 20.62s, which is about 13% faster than the same code but with a generated range, and 42% better than the original code. So clearly, generator machinery is a significant factor here as well.

You can compute with floats `n = n * 48271.0 % 2147483647.0` which yields `11sec` vs `14sec` on my computer (but I'm sure if this is correct) — Andrej Kesely, May 25 '21 at 21:40
@AndrejKesely. I'm seeing a drop from 15.00 to 10.06, so I'd say it looks legit. I'll add that in too. — Mad Physicist, May 25 '21 at 21:43
@AndrejKesely. I went ahead and redid all the timings with `float`. Much better than numpy. Thanks for the tip. — Mad Physicist, May 25 '21 at 22:04
`range` generates an in-memeory list in Python 2, but a constant-sized iterable in Python 3; that could be part of it. — chepner, May 25 '21 at 22:11
@AndrejKesely Using float is very interesting. I suppose that bypasses all of Python `int` complexity and gets us closer to the metal (Python 3 is still slower, but "only" ~25% slower now). Would accuracy still be be guaranteed, though? The product could get as large as 103661183076066 before modulo. — wim, May 25 '21 at 23:19
@wim Yes, that why I said *i'm not sure if it's correct* - I was thinking about the accuracy (cpython uses internally 64-bit `double` for floats). That's question for someone more experienced with IEEE floats. but `float('103661183076066')` yields correct result. — Andrej Kesely, May 25 '21 at 23:25
@wim I've found one answer here: https://stackoverflow.com/a/759223/10035985 `103661183076066` has 47 bits - so according to answers there, it can be represented with Python floats. — Andrej Kesely, May 25 '21 at 23:37
@chepner. Added now. Definitely looks like using a list vs a generator shaves off another 10-15% of the runtime. — Mad Physicist, May 26 '21 at 04:14
The problem with this answer is these modifications are also affecting the Python 2 code. But we can not see the *relative* difference it is making (if any) between Python 2 and Python 3, the absolute timing difference is not useful. I tried a few of the tricks shown here and Python 2 remains significantly faster (~50%) so nothing here has gotten to the bottom of the issue.. *why is a CPython 3.x interpreter slower at executing this code?* — wim, Jun 03 '21 at 18:41
@wim. In a very handwavy way, because it uses infinite precision integers and the iteration machinery became that much heavier. I wonder if you have access to the compiler flags with which your different versions were built? — Mad Physicist, Jun 03 '21 at 20:27
That's my guess too, however Python 2 also uses infinite precision integers in practice (it's automatically switching to long type when necessary) - there should hopefully be a way to coerce Python 3 operate with a similar fixed width efficiency when you know it's safe to do so (and here we know that values are going to be bounded above by `0x7fffffff`). I don't know how to see the compiler flags from an existing installation, but they are probably findable in the homebrew/cask recipes (?) — wim, Jun 03 '21 at 20:28

Why is this LCG significantly faster in Python 2.7 than in Python 3.x?

1 Answers1