5
def main():
    i = 2
    sum = 1
    while i < 100000:
        j = 2
        while j < i:
            if i%j == 0:
                sum += 1
                break
            j += 1
        i += 1

    print(sum)


if __name__ == "__main__":
    main()
#include<iostream>

using namespace std;

int main() {
    int sum = 1;
    for (int i=2; i<100000; i++) {
        for (int j=2; j<i; j++) {
            if (i%j == 0) {
                sum++;
                break;
            }
        }
    }
    cout << sum << endl;
    return 0;
}

C++

Run with: g++ -std=c++11 x.cpp -o x && time ./x

Time: ./x 1.36s user 0.00s system 99% cpu 1.376 total

Python

Run with: python x.py

Time: python x.py 32.10s user 0.21s system 98% cpu 32.854 total

Can anyone explain the huge difference between the time taken by the 2 programs? And what can be done to speed up the python one?

Arpit Singla
  • 69
  • 1
  • 5
  • 6
    One is compiled, the other is interpreted... – Jarod42 Jul 15 '19 at 17:42
  • Why do you use a `for` loop in one and a `while` in the other? – Error - Syntactical Remorse Jul 15 '19 at 17:43
  • 2
    The biggest difference is due to the fact that Python is an [interpreted language](https://en.wikipedia.org/wiki/Interpreted_language), whereas C++ is a [compiled language](https://en.wikipedia.org/wiki/Compiled_language). You can get some of the benefits of a compiled language with Python by using something like [Cython](https://cython.org/). – 0x5453 Jul 15 '19 at 17:43
  • Python uses a virtual machine, to execute the operations, while `C++` is translated to machine language which is directly executed by the CPU. – Daniel Jul 15 '19 at 17:45
  • This looks like an honest attempt at looking at Python performance: https://hackernoon.com/why-is-python-so-slow-e5074b6fe55b Several theories are analyzed. – SergeyA Jul 15 '19 at 17:52
  • 3
    Possible duplicate of [Why are Python Programs often slower than the Equivalent Program Written in C or C++?](https://stackoverflow.com/questions/3033329/why-are-python-programs-often-slower-than-the-equivalent-program-written-in-c-or) – mkrieger1 Jul 15 '19 at 18:20
  • 4
    Hopefully you're not including the output functions in your timings. Secondly, you should time optimized C++ builds, i.e. `-O2` or `-O3` on the compiler command line. – PaulMcKenzie Jul 15 '19 at 18:21
  • @PaulMcKenzie: Given the output functions are invoked once after a fairly expensive set of output-free code, it shouldn't matter much (even if it costs a solid 10 ms, that's still rounding error in the total runtime). The lack of optimization flags might explain why the speed difference was *only* ~24x for despite the benchmark doing nothing but simple math, which is one of CPython's worst cases for overhead:work ratio, where I'd expect worse performance relative to C/C++. – ShadowRanger Jul 15 '19 at 18:39

3 Answers3

23

Here's a simple example of the difference:

i++ in C++ compiles down to (on x86-64 machines) a simple inc REGISTER instruction. Takes a fraction of a cycle to execute.

i += 1 in Python can be disassembled with the dis module via dis.dis('i += 1') which informs us that the bytecode involved is:

  1           0 LOAD_NAME                0 (i)
              2 LOAD_CONST               0 (1)
              4 INPLACE_ADD
              6 STORE_NAME               0 (i)
              8 LOAD_CONST               1 (None)
             10 RETURN_VALUE

Try it online!

Technically, all the instructions that end in _NAME become _FAST in a function (we disassembled an isolated statement, so it behaved slightly differently), and the LOAD_CONST (None)/RETURN_VALUE pair won't exist for the expression in a real function (the function has to do it, but not for every expression), but close enough. In practice, the real bytecode within a function would be more like:

  1           0 LOAD_FAST                0 (i)
              2 LOAD_CONST               0 (1)
              4 INPLACE_ADD
              6 STORE_FAST               0 (i)

Each of those instructions requires either a run through a switch statement or a computed goto (depending on how CPython was compiled), loading the next instruction and updating code position information (it also involves repeatedly checking to ensure no other thread is asking for the GIL). LOAD_FAST and LOAD_CONST instructions involve a C array lookup and reference count adjustment (a single reference count adjustment alone is equivalent to the i++ from before, except it has to change memory, not a register, so it's slower). STORE_FAST similarly involves an C array lookup, reference count adjustment (to decrement the existing value), and often, freeing memory (if the decref removed the last reference to the value). INPLACE_ADD has to dynamically lookup and call a function pointer to perform the addition (and it does so through a few layers of function indirection in the first place), which itself has to extract the underlying C value of each Python int to do the work (and if the numbers are big enough, this involves array based math, which gets ugly), (usually) create a brand new Python int object, and also do more reference count adjustment.

Basically, to get the equivalent of what C/C++ does in a single, cheap assembly instruction against a register, Python had to perform (estimating) half a dozen function calls (including one through a function pointer), dozens of memory lookups, a dozen or so reference count adjustments, etc. Frankly, the most surprising thing is that Python only takes ~24x longer than the C++.

I'll note that the relative cost here is highest for simple math operations; the more work a single bytecode does, the less the interpreter overhead matters. Unfortunately for this case, your code is nothing but simple math, so Python (at least, CPython) is at its worst here.

As for speeding it up, the main rules are:

  1. Write Python code, not C code. You're manually maintaining your counters, when Python's range could do the job for you (and save a lot of individual bytecode instructions). As I mentioned, it's the simplest, cheapest operations where the interpreter overhead is highest, but those operations are normally stuff you don't actually need to do very much, because there is usually a better way to do them (e.g. for loops over range rather than while loops with manual counter adjustment).
  2. For mass math operations, use extension modules that can do the work in bulk, e.g. numpy. All that overhead for a single addition is bad; paying it just once for 1000 additions is pretty trivial.
  3. Try alternate interpreters (e.g. PyPy)
  4. Use Cython to compile C++ from your Python code (requires adding appropriate cdef declarations)
  5. Use ctypes to call existing C libraries, and/or write raw Python C extensions (when Cython can't handle what you want)

Aside from that, you just have to accept that interpreted languages with dynamic typing are always going to have overhead that a compiled, statically typed language won't have.


To address point #1, a Pythonic version of your code would look like:

def main():
    sum = 1
    for i in range(2, 100000):
        for j in range(2, i):
            if i%j == 0:
                sum += 1
                break

    print(sum)

if __name__ == "__main__":
    main()

You could even replace the inner loop with:

    sum += any(i % j == 0 for j in range(2, i))

though that's unlikely to yield any performance benefits, just a bit of code simplification. The performance benefits come from using range, which bundles all the basic math of incrementing and testing into a single dedicated function, reducing the overhead significantly.

For demonstration of the difference in bytecode complexity, consider a function that does nothing but run a loop with either while and a manual counter or for and range:

def whileloop(n):
    i = 0
    while i < n:
        i += 1

def forloop(n):
    for i in range(n):
        pass

Disassembling each function shows:

  3           0 LOAD_CONST               1 (0)
              2 STORE_FAST               1 (i)

  4           4 SETUP_LOOP              20 (to 26)
        >>    6 LOAD_FAST                1 (i)
              8 LOAD_FAST                0 (n)
             10 COMPARE_OP               0 (<)
             12 POP_JUMP_IF_FALSE       24

  5          14 LOAD_FAST                1 (i)
             16 LOAD_CONST               2 (1)
             18 INPLACE_ADD
             20 STORE_FAST               1 (i)
             22 JUMP_ABSOLUTE            6
        >>   24 POP_BLOCK
        >>   26 LOAD_CONST               0 (None)
             28 RETURN_VALUE

for whileloop and:

  8           0 SETUP_LOOP              16 (to 18)
              2 LOAD_GLOBAL              0 (range)
              4 LOAD_FAST                0 (n)
              6 CALL_FUNCTION            1
              8 GET_ITER
        >>   10 FOR_ITER                 4 (to 16)
             12 STORE_FAST               1 (i)

  9          14 JUMP_ABSOLUTE           10
        >>   16 POP_BLOCK
        >>   18 LOAD_CONST               0 (None)
             20 RETURN_VALUE

Try it online!

for forloop. The body of the loop (the stuff executed once per pass, including testing the termination condition) for the while runs from the LOAD_FAST following the SETUP_LOOP to the JUMP_ABSOLUTE, encompassing nine instructions per loop; for the for, it runs from the FOR_ITER to the JUMP_ABSOLUTE, encompassing just three instructions. Since the work done for all of these instructions is pretty trivial, it's easy to see how the overhead of the loop itself would be significantly higher for the manually managed counter with a while loop.

ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
7

[SO]: Python vs CPP: Why is the difference in speed so huge? (@ShadowRanger's answer) explains very well the why (rationale that happens behind the scenes). Here are some attempts that I've done in (incremental) steps.

  1. Setup:

    OS, tools and other info.

    [cfati@cfati-5510-0:/cygdrive/e/Work/Dev/StackOverflow/q057044727]> ~/sopr.sh
    *** Set shorter prompt to better fit when pasted in StackOverflow (or other) pages ***
    
    [prompt]> uname -a
    CYGWIN_NT-10.0 cfati-5510-0 3.0.7(0.338/5/3) 2019-04-30 18:08 x86_64 Cygwin
    [prompt]>
    [prompt]> python3 -c "import sys;print(\"Python {0:s} {1:d}bit on {2:s}\".format(\" \".join(item.strip() for item in sys.version.split(\"\n\")), 64 if sys.maxsize > 0x100000000 else 32, sys.platform))"
    Python 3.6.8 (default, Feb 14 2019, 22:09:48) [GCC 7.4.0] 64bit on cygwin
    [prompt]>
    [prompt]> g++ --version | grep g++
    g++ (GCC) 7.4.0
    [prompt]>
    [prompt]> ls
    dll00.cpp  dll01.cpp  main00.cpp  script00.py  script01.py  script02.py  script03.py  script04.py
    
  2. C++ (0):

    Split the code in 2 files (later you'll see why).

    dll00.cpp:

    #include <iostream>
    
    #if defined(_WIN32)
    #  define DLL_EXPORT_API __declspec(dllexport)
    #else
    #  define DLL_EXPORT_API
    #endif
    
    
    using std::cout;
    using std::endl;
    
    
    DLL_EXPORT_API int func00() {
        int non_primes = 1;
        for (int i = 2; i < 100000; i++) {
            for (int j = 2; j < i; j++) {
                if (i % j == 0) {
                    non_primes++;
                    break;
                }
            }
        }
        cout << non_primes << endl;
        return 0;
    }
    

    main00.cpp:

    #include "dll00.cpp"
    
    
    int main() {
        return func00();
    }
    

    Output:

    [prompt]> g++ -std=c++11 main00.cpp -o main000
    [prompt]>
    [prompt]> time ./main000
    90407
    
    real    0m1.384s
    user    0m1.359s
    sys     0m0.000s
    
  3. script00.py:

    Your original script (with small corrections).

    #!/usr/bin/env python3
    
    
    def main():
        non_primes = 1
        i = 2
        while i < 100000:
            j = 2
            while j < i:
                if i % j == 0:
                    non_primes += 1
                    break
                j += 1
            i += 1
        print(non_primes)
    
    
    if __name__ == "__main__":
        main()
    

    Output:

    [prompt]> time python3 script00.py
    90407
    
    real    0m53.738s
    user    0m53.703s
    sys     0m0.031s
    
  4. script01.py:

    Replaced the (inefficient) while loops by for (using range).

    #!/usr/bin/env python3
    
    
    def main():
        non_primes = 1
        for i in range(2, 100000):
            for j in range(2, i):
                if i % j == 0:
                    non_primes += 1
                    break
        print(non_primes)
    
    
    if __name__ == "__main__":
        main()
    

    Output:

    [prompt]> time python3 script01.py
    90407
    
    real    0m34.142s
    user    0m34.124s
    sys     0m0.000s
    
  5. script02.py:

    Use Python style 0 equality test.

    #!/usr/bin/env python3
    
    
    def main():
        non_primes = 1
        for i in range(2, 100000):
            for j in range(2, i):
                if not i % j:
                    non_primes += 1
                    break
        print(non_primes)
    
    
    if __name__ == "__main__":
        main()
    

    Output:

    [prompt]> time python3 script02.py
    90407
    
    real    0m28.440s
    user    0m28.406s
    sys     0m0.031s
    
  6. script03.py:

    Specific for this case. The search for divisors is highly inefficient. It iterates til the number itself (, when in fact it should only go to its square root), generating lots of useless operations that deepen the performance gap between the 2 languages.

    #!/usr/bin/env python3
    
    from math import sqrt
    
    
    def main():
        non_primes = 1
        for i in range(2, 100000):
            for j in range(2, int(sqrt(i) + 1)):
                if not i % j:
                    non_primes += 1
                    break
        print(non_primes)
    
    
    if __name__ == "__main__":
        main()
    

    Output:

    [prompt]> time python3 script03.py
    90407
    
    real    0m0.291s
    user    0m0.265s
    sys     0m0.015s
    

    As seen, a whopping difference (almost 100 times faster) than the previous version, and even better than the (original) C code.

  7. C++ (1):

    The previous step operated on the algorithm itself. Change the C++ variant as well, otherwise the comparison would be unfair.

    dll01.cpp:

    #include <iostream>
    #include <math.h>
    
    #if defined(_WIN32)
    #  define DLL_EXPORT_API __declspec(dllexport)
    #else
    #  define DLL_EXPORT_API
    #endif
    
    
    using std::cout;
    using std::endl;
    
    
    #if defined(__cplusplus)
    extern "C" {
    #endif
    
    DLL_EXPORT_API int func00() {
        int non_primes = 1;
        for (int i = 2; i < 100000; i++) {
            for (int j = 2; j < static_cast<int>(sqrt(i) + 1); j++) {
                if (i % j == 0) {
                    non_primes++;
                    break;
                }
            }
        }
        cout << non_primes << endl;
        return 0;
    }
    
    #if defined(__cplusplus)
    }
    #endif
    

    main00.cpp must (obviously) be modified accordingly (#include "dll01.cpp").

    Output:

    [prompt]> g++ -std=c++11 main00.cpp -o main001
    [prompt]>
    [prompt]> time ./main001
    90407
    
    real    0m0.279s
    user    0m0.250s
    sys     0m0.030s
    
  8. Call C++ code (C interfaced) from Python via [Python 3.Docs]: ctypes - A foreign function library for Python:

    Uses the C++ code from previous step.

    script04.py:

    #!/usr/bin/env python3
    
    import ctypes
    
    
    def main():
        dll = ctypes.CDLL("./dll01.so")
        func = dll.func00
        func.argtypes = []
        func.restype = ctypes.c_int
        func()
    
    
    if __name__ == "__main__":
        main()
    

    Output:

    [prompt]> g++ -std=c++11 -fPIC -shared dll01.cpp -o dll01.so
    [prompt]>
    [prompt]> time python3 script04.py
    90407
    
    real    0m0.327s
    user    0m0.281s
    sys     0m0.031s
    

Conclusions (drawn from the above examples):

  • I've run each step 3 times, and placed here the middle result. However, a test with meaningful results should be run several thousands times and an average should be computed. Also, the fact that I'm using Cygwin might interfere with the results

  • Writing Pythonic code, improved performance almost 2 times (#4., #5.)

  • Writing an efficient algorithm, reduced the difference between the 2 languages almost to 0 (#6. vs. #7.), and (pure) Python code seems to be running faster than #8..
    However, don't let yourself deceived by these facts. As proven, if the number of operations grows (and not necessarily due to inefficiency), C++ will work a lot faster.
    You can check this by applying step #8. to dll00.cpp

CristiFati
  • 38,250
  • 9
  • 50
  • 87
  • Just for posterity, could you include the version of Python you used (major.minor, 32 vs. 64 bit)? Every once in a (long) while they change the interpreter to improve unusually slow things (e.g. recent releases have dramatically reduced the overhead of method calls with solely positional arguments, so some advice about avoiding method calls when syntax solutions exist is less relevant than before), so it would be nice to be able to compare if later version, for example, make `range` cheaper, or simple arithmetic, or comparisons to `0`, or whatever. – ShadowRanger Jul 15 '19 at 22:36
  • 2
    @ShadowRanger: version was already there (in ***#1.***), added architecture. – CristiFati Jul 15 '19 at 22:58
  • 1
    Thanks. Missed the version, sorry about that! – ShadowRanger Jul 15 '19 at 23:15
3

You are calculating something like the non-prime numbers up to some n. Doing so with a sieve, is much faster:

def count_primes(n):
    count = 0
    w = [False]*n
    for m in range(2,n):
        if not w[m]:
            w[m*m::m] = [True] * ((n+m-m*m-1)//m)
            count+=1
    return count

print(99999 - sieve(100000))

This runs in milliseconds, even with python.

Daniel
  • 42,087
  • 4
  • 55
  • 81
  • 1
    While true, not all problems can be solved with better algorithms; increase your sieve boundary, and optimize the C code equivalently (or even better, since C code can do all sorts of bit twiddling hacks to reduce memory usage at no real cost in performance, which Python can't match), and you'll find CPython still falls behind as you provide larger and larger `n`s. – ShadowRanger Jul 15 '19 at 18:27
  • while this is true, most part of this algorithm is executed with `C` code, which makes it run with native speed. – Daniel Jul 15 '19 at 19:02
  • Yes, it's a *lot* faster, but even so, equivalently optimized C code wins. I've written similar code (it output the primes themselves, not just the count, and used some tricks to reduce the bytecode per loop further), and hyperoptimized Python lost to the hyperoptimized C by a factor of ~8x for an input of `n = 10**8`. Don't get me wrong, the Python code was *much* faster to write, and even with some absurdly ugly optimizations, was still easier to read/verify. That's the whole point of writing Python at all; it's faster and easier to write, at the expense of being slower to execute. – ShadowRanger Jul 15 '19 at 19:10