Why does a large for loop with 10 billion iterations take a much longer time to run in Python than in C?

Question

I am currently comparing two loop calculation in Python3 and C. For Python, I have:

# Python3
t1 = time.process_time()
a = 100234555
b = 22333335
c = 341500
for i in range(1, 10000000001):
    a = a - (b % 2)
    b = b - (c % 2)
print("Sum is", a+b)
t2 = time.process_time()
print(t2-t1, "Seconds")

Then in C, I do the same thing:

#include <stdio.h>

int main() {
   long long a = 100234555;
   long long b = 22333335;  
   long long c = 341500;
   for(long long i = 1; i <= 10000000000; i++){
        a = a - (b % 2);
        b = b - (c % 2);
   }
   printf("Sum is %lld\n", a+b);
   return 0;
}

I timed both the code in Python and in C. The timing for Python is around 3500 seconds while the timing in C (including compilation and execution) only takes around 0.3 seconds.

I am wondering how there is such a big difference in timing. The execution was done on a server with 100 GB Ram and enough processing power.

interesting. what if replace % with bitwise & 1? is it same way slow? — skyboyer, Sep 29 '18 at 16:20
Because C is compiled in native machine code heavily optimized for your current architecture and runs directly over the hardware, while Python is an interpreted language that doesn't do any of that. — Havenard, Sep 29 '18 at 16:20
@skyboyer Notice substituting `%2` with `&1` creates a different functionality in C. `%2` results in `1,0,-1`. `&1` results in `1,0`. — chux - Reinstate Monica, Sep 29 '18 at 18:00
@chux, thank you, did not know that. can you give me example when it returns -1? just curious and was unable to find at first try — skyboyer, Sep 29 '18 at 18:02
@skyboyer `printf("%d\n", -1%2);` See [What's the difference between “mod” and “remainder”?](https://stackoverflow.com/q/13683563/2410359). — chux - Reinstate Monica, Sep 29 '18 at 18:03

wizzwizz4 · Accepted Answer · 2018-09-29T20:33:24.543

19

It's partially due to the fact that Python byte code is executed by a program instead of the CPU directly, but most of the overhead is caused by the memory allocation and deallocation caused by the immutability of integers which is due to the object model, not the interpretedness.

What's going on is that your C code can change the value of the numbers, but in Python numbers are immutable which means they do not change. This means that when you do a sum, Python has to create a new int object for each new value, and then destroy the old ints after they're no longer used. This makes it much slower than just modifying a single memory value.

There is also the possibility that your C compiler is being clever, and reasons that via a chain of optimisations it can completely remove your for loop, and the result will be identical – as if the loop had actually run. I'd expect the code to run much faster than it did if that had been the case in your example, but it could do that.

Python has no such smart compiler. It can't do something as grand and clever as that; it's just not designed to optimise the code because it's so hard to do so reliably in a dynamically-typed language (though the fact that Python is strongly-typed does make it somewhat of a possibility.

edited Sep 29 '18 at 20:33

answered Sep 29 '18 at 16:20

wizzwizz4

6,140
2
26
62

1

Which implies. It is only the implementation detail – 0___________ Sep 29 '18 at 16:23
2

Is this for real? Using object types instead of pure values for integers and having no mechanism to optimize the objects out when they're useless?? Why??? – R.. GitHub STOP HELPING ICE Sep 29 '18 at 16:30
1

@R.. Python isn't designed for speed. If you want mutable integers in Python you can add that as an extension, but it'd be confusing for most programmers if it worked like that by default due to Python's reference-by-default model. – wizzwizz4 Sep 29 '18 at 16:32
It's not a matter of "wanting mutable integers". Nobody wants that. It's a matter of wanting pure values that have no object or object lifetime associated with them, since that's a **huge** layer of inefficiency with no semantic difference. – R.. GitHub STOP HELPING ICE Sep 29 '18 at 19:06
@R.. Well, that'd have to be special-cased. It's theoretically possible... ish. I mean, you'd have to forego BigInt support, and unless you were reimplementing Python the easiest way to do it would be to add a C extension with attributes storing numbers. But there's not really a way of doing this generally without wrecking lots of the interoperability of the language. Python just isn't what you want if you're looking for speed. – wizzwizz4 Sep 29 '18 at 19:16
Right. But it's such a gigantic, important case that it should be special-cased. – R.. GitHub STOP HELPING ICE Sep 29 '18 at 20:22
@R.. It complicates _so much_ implementation and will almost certainly result in a leaky abstraction. You can try it, but I doubt you'd succeed. – wizzwizz4 Sep 29 '18 at 20:28
Obviously an optimization that leaks (has observably different behavior, aside from performance or memory usage) is buggy and should not be done. But it's certainly possible to do in non-leaky ways. – R.. GitHub STOP HELPING ICE Sep 29 '18 at 20:31
1

@R.. I'd like to see that done. If you fork Python on GitHub and try to implement that, I'll help. – wizzwizz4 Sep 29 '18 at 20:36
Almost two years later, I'd like to take back the "on GitHub" bit. – wizzwizz4 Jul 08 '20 at 06:28

chqrlie · Answer 2 · 2018-09-29T21:53:26.507

As dmuir noticed, the code can be simplified drastically if the compiler propagates some constants correctly. For example: clang -O1 compiles the C code down to this (cf https://gcc.godbolt.org/z/1ZH8Rm ):

main:                                   # @main
        push    rax
        movabs  rsi, -9877432110
        mov     edi, offset .L.str
        xor     eax, eax
        call    printf
        xor     eax, eax
        pop     rcx
        ret
.L.str:
        .asciz  "Sum is %lld\n"

gcc -O1 produces essentially similar code.

Since this boils down to a single call to printf, the explanation seems to be:

The Python compiler is not as smart as the C compiler to optimize this code.
Your C compiler takes a long time to compile this 12 line piece of code. 3 seconds is way too long given your hardware setup! It only takes 0.15 seconds on my dinky laptop to compile and run the code with all optimisations. Are you compiling as C++?

Testing the C version with optimisations disabled (-O0) produces this output:

$ time (clang -O0 -o loop10g loop10g.c && ./loop10g)
Sum is -9877432110

real    4m15.352s
user    3m47.232s
sys     0m3.252s

Still much faster with unoptimized C than Python: 255 seconds vs: >3500

The Python code is interpreted with byte-code and a stack with dynamically typed values: a factor of 10 to 20 is a typical slowdown. Furthermore the integer arithmetic automatically switches to bignum mode for large values, which may be the case here, although the penalty should be even higher.

@Stargateur: not really, the test is stupid. Instead try and compute `fib(50)` with the classic recursive definition. Recognising the Fibonacci function and converting the recursion to iterative code would be cheating IMHO. — chqrlie, Sep 29 '18 at 21:56
Figuring out what the programmer is trying to do and doing it faster is not "cheating". It's optimisation. Even if it's used on dumb fibonacci code. — rici, Sep 30 '18 at 03:15
By the way, CPython always uses a bignum representation, afaik. Bignum legs are 30 bits (on machines whose ALU is at least 64 bits), and operations on single-leg numbers is somewhat optimised. Also, there is a cache of small integers to avoid excess allocation, but I think it only goes up to 256). — rici, Sep 30 '18 at 03:22

0___________ · Answer 3 · 2018-09-29T16:19:46.493

-2

The answer is very simple. Python is the interpreted language. All the instructions are executed by the interpreter (special program which executes the script). It is much slower than the C code which is compiled to the native machine code.

edited Sep 29 '18 at 16:19

answered Sep 29 '18 at 16:16

0___________

60,014
4
34
74

1

This is not the answer. It's part of it, but most of the overhead is caused by the memory allocation and deallocation caused by the immutability of integers which is due to the object model, not the interpretedness. – wizzwizz4 Sep 29 '18 at 16:18
2

@wizzwizz4 It is the answer – 0___________ Sep 29 '18 at 16:19
1

I guess there is also a goid amount of compiler optimization involved. – Klaus D. Sep 29 '18 at 16:20
1

@KlausD. Optimisation is part of the compilation. Interpreded language will always be much slower than the compiled ones. – 0___________ Sep 29 '18 at 16:24
2

Running the **same** simple computation instructions on Python is slower than on C, but not to the extent observed. The factor of 1000 can only be explained by the compiler optimizing out most of the iterations. – Klaus D. Sep 29 '18 at 16:30
4

The OP didn't say what optimisation was used, but it seems to me a smart optimiser could remove the loop entirely -- since c%2 is 0, b never changes and since b%2 is 1 the effect of the loop is to subtract the loop count from a. – dmuir Sep 29 '18 at 16:52
2

Nothing stops an interpreter from doing static analysis and using it to optimise. And many do. – rici Sep 30 '18 at 03:17

Why does a large for loop with 10 billion iterations take a much longer time to run in Python than in C?

3 Answers3

Linked