Here's a simple example of the difference:
i++
in C++ compiles down to (on x86-64 machines) a simple inc REGISTER
instruction. Takes a fraction of a cycle to execute.
i += 1
in Python can be disassembled with the dis
module via dis.dis('i += 1')
which informs us that the bytecode involved is:
1 0 LOAD_NAME 0 (i)
2 LOAD_CONST 0 (1)
4 INPLACE_ADD
6 STORE_NAME 0 (i)
8 LOAD_CONST 1 (None)
10 RETURN_VALUE
Try it online!
Technically, all the instructions that end in _NAME
become _FAST
in a function (we disassembled an isolated statement, so it behaved slightly differently), and the LOAD_CONST (None)
/RETURN_VALUE
pair won't exist for the expression in a real function (the function has to do it, but not for every expression), but close enough. In practice, the real bytecode within a function would be more like:
1 0 LOAD_FAST 0 (i)
2 LOAD_CONST 0 (1)
4 INPLACE_ADD
6 STORE_FAST 0 (i)
Each of those instructions requires either a run through a switch
statement or a computed goto
(depending on how CPython was compiled), loading the next instruction and updating code position information (it also involves repeatedly checking to ensure no other thread is asking for the GIL). LOAD_FAST
and LOAD_CONST
instructions involve a C array lookup and reference count adjustment (a single reference count adjustment alone is equivalent to the i++
from before, except it has to change memory, not a register, so it's slower). STORE_FAST
similarly involves an C array lookup, reference count adjustment (to decrement the existing value), and often, freeing memory (if the decref removed the last reference to the value).
INPLACE_ADD
has to dynamically lookup and call a function pointer to perform the addition (and it does so through a few layers of function indirection in the first place), which itself has to extract the underlying C value of each Python int
to do the work (and if the numbers are big enough, this involves array based math, which gets ugly), (usually) create a brand new Python int
object, and also do more reference count adjustment.
Basically, to get the equivalent of what C/C++ does in a single, cheap assembly instruction against a register, Python had to perform (estimating) half a dozen function calls (including one through a function pointer), dozens of memory lookups, a dozen or so reference count adjustments, etc. Frankly, the most surprising thing is that Python only takes ~24x longer than the C++.
I'll note that the relative cost here is highest for simple math operations; the more work a single bytecode does, the less the interpreter overhead matters. Unfortunately for this case, your code is nothing but simple math, so Python (at least, CPython) is at its worst here.
As for speeding it up, the main rules are:
- Write Python code, not C code. You're manually maintaining your counters, when Python's
range
could do the job for you (and save a lot of individual bytecode instructions). As I mentioned, it's the simplest, cheapest operations where the interpreter overhead is highest, but those operations are normally stuff you don't actually need to do very much, because there is usually a better way to do them (e.g. for
loops over range
rather than while
loops with manual counter adjustment).
- For mass math operations, use extension modules that can do the work in bulk, e.g.
numpy
. All that overhead for a single addition is bad; paying it just once for 1000 additions is pretty trivial.
- Try alternate interpreters (e.g. PyPy)
- Use Cython to compile C++ from your Python code (requires adding appropriate
cdef
declarations)
- Use
ctypes
to call existing C libraries, and/or write raw Python C extensions (when Cython can't handle what you want)
Aside from that, you just have to accept that interpreted languages with dynamic typing are always going to have overhead that a compiled, statically typed language won't have.
To address point #1, a Pythonic version of your code would look like:
def main():
sum = 1
for i in range(2, 100000):
for j in range(2, i):
if i%j == 0:
sum += 1
break
print(sum)
if __name__ == "__main__":
main()
You could even replace the inner loop with:
sum += any(i % j == 0 for j in range(2, i))
though that's unlikely to yield any performance benefits, just a bit of code simplification. The performance benefits come from using range
, which bundles all the basic math of incrementing and testing into a single dedicated function, reducing the overhead significantly.
For demonstration of the difference in bytecode complexity, consider a function that does nothing but run a loop with either while
and a manual counter or for
and range
:
def whileloop(n):
i = 0
while i < n:
i += 1
def forloop(n):
for i in range(n):
pass
Disassembling each function shows:
3 0 LOAD_CONST 1 (0)
2 STORE_FAST 1 (i)
4 4 SETUP_LOOP 20 (to 26)
>> 6 LOAD_FAST 1 (i)
8 LOAD_FAST 0 (n)
10 COMPARE_OP 0 (<)
12 POP_JUMP_IF_FALSE 24
5 14 LOAD_FAST 1 (i)
16 LOAD_CONST 2 (1)
18 INPLACE_ADD
20 STORE_FAST 1 (i)
22 JUMP_ABSOLUTE 6
>> 24 POP_BLOCK
>> 26 LOAD_CONST 0 (None)
28 RETURN_VALUE
for whileloop
and:
8 0 SETUP_LOOP 16 (to 18)
2 LOAD_GLOBAL 0 (range)
4 LOAD_FAST 0 (n)
6 CALL_FUNCTION 1
8 GET_ITER
>> 10 FOR_ITER 4 (to 16)
12 STORE_FAST 1 (i)
9 14 JUMP_ABSOLUTE 10
>> 16 POP_BLOCK
>> 18 LOAD_CONST 0 (None)
20 RETURN_VALUE
Try it online!
for forloop
. The body of the loop (the stuff executed once per pass, including testing the termination condition) for the while
runs from the LOAD_FAST
following the SETUP_LOOP
to the JUMP_ABSOLUTE
, encompassing nine instructions per loop; for the for
, it runs from the FOR_ITER
to the JUMP_ABSOLUTE
, encompassing just three instructions. Since the work done for all of these instructions is pretty trivial, it's easy to see how the overhead of the loop itself would be significantly higher for the manually managed counter with a while
loop.