Why does my python function run faster than the one in c++?

Question

I have been writing a simple test to compare the speed improvements of c++ over python. My results were unexpected with c++ being almost trice as slow as the program written in python. I guess there is something in the loop inside the python function using iterators or something.

C++ code

#include <ctime>
#include <chrono>
using namespace std;
using namespace chrono;

void mult(int* a, int b)
{
    for (size_t i = 0; i < 100000000; i++)
    {
        a[i] *= b;
    }
}

int main()
{
    srand(time(0));
    int* a = new int[100000000];
    int l = 100000000;
    for (int i = 0; i < l; ++i)
    {
        a[i] = rand();
    }
    auto start = high_resolution_clock::now();
    mult(a, 5);
    auto stop = high_resolution_clock::now();
    auto duration = duration_cast<milliseconds>(stop - start);
    cout << duration.count() << endl;
    delete[] a;
}

Python code

import time

def mult(x, a):
    for i in [x]:
        i *= a

x = np.random.random(100000000)
start = time.time() 
mult(x, 5)
elapsed = time.time()
elapsed = elapsed - start
print ("Time spent in (mult) is: ", elapsed)

Results

c++: 200 milliseconds

Python: 65 milliseconds

Welcome to Stack Overflow. Where the code says `for i in [x]: i *= a`, what exactly do you expect that to mean? *How* exactly do you expect that it is working? What do you expect `[x]`, specifically to do? What do you think `x` is before that point? — Karl Knechtel, Jun 29 '22 at 17:08
Seems to me that when both sets of code are compiled in release mode, the behavior is the same. Which makes sense; `numpy` is a C extension, and the vast majority of the work is done in `numpy`, so you're largely comparing C to C++ (in situations where there is no meaningful difference; the timed code is implemented the same way in C or C++. — ShadowRanger, Jun 29 '22 at 17:11
Umm your C++ code is writing back to memory with `a[i] *= b;`, but your python code does a single multiplication and throws out the value. — smac89, Jun 29 '22 at 17:11
@smac89: True, but Python's optimizer is too dumb to avoid computing the results. It ends up doing the same work. — ShadowRanger, Jun 29 '22 at 17:12
Set your optimization level for C/C++ a little higher: since `a` is never used after it's computed, the compiler should be able to remove the computation, making the code run in, oh, 1 microsecond. — Victor Eijkhout, Jun 29 '22 at 17:14
I see, so the problem is that in c++ the array is a reference as oposed to python where it is a value. — Matzul, Jun 29 '22 at 17:14
Are you sure these are your timing results? My results are almost 40 times slower. My machine is old, but I don't think it's *that* old. Did you perhaps typo an extra `0` in the iteration counts? — Karl Knechtel, Jun 29 '22 at 17:31
@VictorEijkhout Good point. I added a bit to my answer to account for that. — Karl Knechtel, Jun 29 '22 at 17:34
@VictorEijkhout As far as I see there is no way for `rand()` to alter `a` and no way for `high_resolution_clock::now();` to alter `a`. So yes, the compiler should see that `a` is never used. It should remove the multiplication, should remove saving the result of `rand()` and should eliminate the `new` and `delete` calls. So why does neither gcc nor clang do so? — Goswin von Brederlow, Jun 29 '22 at 17:40
@VictorEijkhout I looked a bit more into this: If you remove the timer then gcc is smart enough to eliminate the `mult` but clang is not. If you comment out the `mult` as well then there is no change for gcc but clang suddenly eliminates the `a[i]` in `a[i] = rand();` and then the `new` and `delete` as well. clang only calls `rand()` 10000000 times. — Goswin von Brederlow, Jun 29 '22 at 18:15

score 6 · Accepted Answer · answered Jun 29 '22 at 17:28

There are many reasons why this performance test does not give useful results.

Don't compare, or pay attention to, release timing. The entire point of using a language like C or C++ is to enable (static) compiler optimizations. So really, the results are the same. On the other hand, it is important to make sure that aggressive compiler optimizations don't optimize out your entire test (due to the result of computation going unused, or due to undefined behaviour anywhere in your program, or due to the compiler assuming that part of the code can't actually be reached because it there would be undefined behaviour if it were reached).
for i in [x]: is a pointless loop: it creates a Python list of one element, and iterates once. That one iteration does i *= a, i.e., it multiplies i, which is the Numpy array. The code only works accidentally; it happens that Numpy arrays specially define * to do a loop and multiply each element. Which brings us to...
The entire point of using Numpy is that it optimizes number-crunching by using code written in C behind the scenes to implement algorithms and data structures. i simply contains a pointer to a memory allocation that looks essentially the same as the one the C program uses, and i *= a does a few O(1) checks and then sets up and executes a loop that looks essentially the same as the one in the C code.
This is not reliable timing methodology, in general. That is a whole other kettle of fish. The Python standard library includes a timeit module intended to make timing easier and help avoid some of the more basic traps. But doing this properly in general is a research topic beyond the scope of a Stack Overflow question.

"But I want to see the slow performance of native Python, rather than Numpy's optimized stuff - "

If you just want to see the slow performance of Python iteration, then you need for the loop to actually iterate over the elements of the array (and write them back):

def mult(x, a):
    for i in range(len(x)):
        x[i] *= a

Except that experienced Pythonistas won't write the code that way, because range(len( is ugly. The Pythonic approach is to create a new list:

def mult(x, a):
    return [i*a for i in x]

That will also show you the inefficiency of native Python data structures (we need to create a new list, which contains pointers to int objects).

On my machine, it is actually even slower to process the Numpy array this way than a native Python list. This is presumably because of the extra work that has to be done to interface the Numpy code with native Python, and "box" the raw integer data into int objects.

pavelkolodin · Answer 2 · 2022-06-29T19:41:38.030

-3

Your python code does not the same thing. It does not iterate all elements of x, it rather iterate over list (of size 1) containing 1 element: the x (list). List of one list.

edited Jun 29 '22 at 19:41

answered Jun 29 '22 at 17:13

pavelkolodin

2,859
3
31
74

This is not correct. The element contained in the `x` list is a Numpy array, which handles `*=` differently from native Python lists. – Karl Knechtel Jun 29 '22 at 17:29
This is incorrect on multiple accounts. While poorly expressed the OP's code _does_ iterate through all the elements of `x`, which you can trivially check by running it. Maybe you're not familiar with numpy's broadcasting rules, or misunderstanding what the strange loop structure is doing? It's also not true that the result of the multiplication isn't being stored. But even if the result wasn't stored, it's not possible for Python to "optimize away" calls to numpy, or for numpy to elide work based on how the result is used in Python. Talking about CPU cycles is completely without meaning here – Brian61354270 Jun 29 '22 at 17:37
The `for` loop is ineffective though, right? It loops of the one element `x` in the list and then does the numpy `*=` on the numby array `x`, right? – Goswin von Brederlow Jun 29 '22 at 17:43

Why does my python function run faster than the one in c++?

2 Answers2