Why is my Python NumPy code faster than C++?

Question

Why is this Python NumPy code,

import numpy as np
import time

k_max = 40000
N = 10000

data = np.zeros((2,N))
coefs = np.zeros((k_max,2),dtype=float)

t1 = time.time()
for k in xrange(1,k_max+1):
    cos_k = np.cos(k*data[0,:])
    sin_k = np.sin(k*data[0,:])
    coefs[k-1,0] = (data[1,-1]-data[1,0]) + np.sum(data[1,:-1]*(cos_k[:-1] - cos_k[1:]))
    coefs[k-1,1] = np.sum(data[1,:-1]*(sin_k[:-1] - sin_k[1:]))
t2 = time.time()

print('Time:')
print(t2-t1)

faster than the following C++ code?

#include <cstdio>
#include <iostream>
#include <cmath>
#include <time.h>

using namespace std;

// consts
const unsigned int k_max = 40000;
const unsigned int N = 10000;

int main()
{
    time_t start, stop;
    double diff;
    // table with data
    double data1[ N ];
    double data2[ N ];
    // table of results
    double coefs1[ k_max ];
    double coefs2[ k_max ];
    // main loop
    time( & start );
    for( unsigned int j = 1; j<N; j++ )
    {
        for( unsigned int i = 0; i<k_max; i++ )
        {
            coefs1[ i ] += data2[ j-1 ]*(cos((i+1)*data1[ j-1 ]) - cos((i+1)*data1[ j ]));
            coefs2[ i ] += data2[ j-1 ]*(sin((i+1)*data1[ j-1 ]) - sin((i+1)*data1[ j ]));
        }
    }
    // end of main loop
    time( & stop );
    // speed result
    diff = difftime( stop, start );
    cout << "Time: " << diff << " seconds";
    return 0;
}

The first one shows: "Time: 8 seconds" while the second: "Time: 11 seconds"

I know that NumPy is written in C, but I would still think that C++ example would be faster. Am I missing something? Is there a way to improve the C++ code (or the Python one)?

Version 2 of the code

I have changed the C++ code (dynamical tables to static tables) as suggested in one of the comments. The C++ code is faster now, but still much slower than the Python version.

Version 3 of the code

I have changed from debug to release mode and increased 'k' from 4000 to 40000. Now NumPy is just slightly faster (8 seconds to 11 seconds).

Probably because you're using an array of pointers to dynamically allocated arrays, instead of a single block of memory. — juanchopanza, Dec 28 '16 at 16:17
I agree with @juancho . The cache lining behavior will be horrible. — πάντα ῥεῖ, Dec 28 '16 at 16:20
I get `Time: 1 sekund` on my machine. Did you run the C++ code in debug mode by any chance? — Bo Persson, Dec 28 '16 at 16:25
I'm assuming numpy also takes advantage of SIMD. C++ compilers can auto-vectorize code but it isn't guaranteed. There could be other factors at play. Also, you probably aren't generating release mode code as stated. — RyanP, Dec 28 '16 at 16:33
@juanchopanza I have changed the dynamical tables in C++ and it made the code faster but still significantly slower than Python. — Mateusz, Dec 28 '16 at 16:36
@Mateusz in your numpy example, all of the data are 0's I believe (I assume that is what .zeros does). What happens if you set all of your data in the C++ app to 0 first? — RyanP, Dec 28 '16 at 16:38
Possible duplicate of [why are numpy arrays so fast](http://stackoverflow.com/questions/8385602/why-are-numpy-arrays-so-fast) — Lucas, Dec 28 '16 at 16:48
@BoPersson I also run this code (c++) and result was also 1 s, where python code had the same time as stated - so debug mode is probably the case. — Gwidryj, Dec 28 '16 at 17:07
@RyanP using 0's doesn't change the result. Release mode made C++ code much faster but still slower than Python version. — Mateusz, Dec 28 '16 at 17:08
When I try to run your python code, on ideone, I get [time limit exceeded](http://ideone.com/x5CI6h), while running your C++ code, returns [time: 0 seconds](http://ideone.com/fcBJLX). So.. Can't reproduce? — Algirdas Preidžius, Dec 28 '16 at 17:22
@AlgirdasPreidžius because ideone has timeout (probably 5s as stated in your link), when you decrees e.g. k_max code will be executed properly. I can reproduce both python and c++. — Gwidryj, Dec 28 '16 at 18:16
@Gwidryj I don't have python installed on my machine, so, to test it, I chose ideone. I can understand why python version got time limit exceeded (given the timeframes mentioned in the question), but since C++ code finished in less than a second, it was clearly faster than python version. And that was the whole point of my comment. — Algirdas Preidžius, Dec 28 '16 at 21:29

score 52 · Accepted Answer · edited Aug 16 '21 at 16:34

I found this question interesting, because every time I encountered similar topic about the speed of NumPy (compared to C/C++) there was always answers like "it's a thin wrapper, its core is written in C, so it's fast", but this doesn't explain why C should be slower than C with additional layer (even a thin one).

The answer is: your C++ code is not slower than your Python code when properly compiled.

I've done some benchmarks, and at first it seemed that NumPy is surprisingly faster. But I forgot about optimizing the compilation with GCC.

I've computed everything again and also compared results with a pure C version of your code. I am using GCC version 4.9.2, and Python 2.7.9 (compiled from the source with the same GCC). To compile your C++ code I used g++ -O3 main.cpp -o main, to compile my C code I used gcc -O3 main.c -lm -o main. In all examples I filled data variables with some numbers (0.1, 0.4), as it changes results. I also changed np.arrays to use doubles (dtype=np.float64), because there are doubles in C++ example. My pure C version of your code (it's similar):

#include <math.h>
#include <stdio.h>
#include <time.h>

const int k_max = 100000;
const int N = 10000;

int main(void)
{
    clock_t t_start, t_end;
    double data1[N], data2[N], coefs1[k_max], coefs2[k_max], seconds;
    int z;
    for( z = 0; z < N; z++ )
    {
        data1[z] = 0.1;
        data2[z] = 0.4;
    }

    int i, j;
    t_start = clock();
    for( i = 0; i < k_max; i++ )
    {
        for( j = 0; j < N-1; j++ )
        {
            coefs1[i] += data2[j] * (cos((i+1) * data1[j]) - cos((i+1) * data1[j+1]));
            coefs2[i] += data2[j] * (sin((i+1) * data1[j]) - sin((i+1) * data1[j+1]));
        }
    }
    t_end = clock();

    seconds = (double)(t_end - t_start) / CLOCKS_PER_SEC;
    printf("Time: %f s\n", seconds);
    return coefs1[0];
}

For k_max = 100000, N = 10000 results where following:

Python 70.284362 s
C++ 69.133199 s
C 61.638186 s

Python and C++ have basically the same time, but note that there is a Python loop of length k_max, which should be much slower compared to C/C++ one. And it is.

For k_max = 1000000, N = 1000 we have:

Python 115.42766 s
C++ 70.781380 s

For k_max = 1000000, N = 100:

Python 52.86826 s
C++ 7.050597 s

So the difference increases with fraction k_max/N, but python is not faster even for N much bigger than k_max, e. g. k_max = 100, N = 100000:

Python 0.651587 s
C++ 0.568518 s

Obviously, the main speed difference between C/C++ and Python is in the for loop. But I wanted to find out the difference between simple operations on arrays in NumPy and in C. Advantages of using NumPy in your code consists of: 1. multiplying the whole array by a number, 2. calculating sin/cos of the whole array, 3. summing all elements of the array, instead of doing those operations on every single item separately. So I prepared two scripts to compare only these operations.

Python script:

import numpy as np
from time import time

N = 10000
x_len = 100000

def main():
    x = np.ones(x_len, dtype=np.float64) * 1.2345

    start = time()
    for i in xrange(N):
        y1 = np.cos(x, dtype=np.float64)
    end = time()
    print('cos: {} s'.format(end-start))

    start = time()
    for i in xrange(N):
        y2 = x * 7.9463
    end = time()
    print('multi: {} s'.format(end-start))

    start = time()
    for i in xrange(N):
        res = np.sum(x, dtype=np.float64)
    end = time()
    print('sum: {} s'.format(end-start))

    return y1, y2, res

if __name__ == '__main__':
    main()

# results
# cos: 22.7199969292 s
# multi: 0.841291189194 s
# sum: 1.15971088409 s

C script:

#include <math.h>
#include <stdio.h>
#include <time.h>

const int N = 10000;
const int x_len = 100000;

int main()
{
    clock_t t_start, t_end;
    double x[x_len], y1[x_len], y2[x_len], res, time;
    int i, j;
    for( i = 0; i < x_len; i++ )
    {
        x[i] = 1.2345;
    }

    t_start = clock();
    for( j = 0; j < N; j++ )
    {
        for( i = 0; i < x_len; i++ )
        {
            y1[i] = cos(x[i]);
        }
    }
    t_end = clock();
    time = (double)(t_end - t_start) / CLOCKS_PER_SEC;
    printf("cos: %f s\n", time);

    t_start = clock();
    for( j = 0; j < N; j++ )
    {
        for( i = 0; i < x_len; i++ )
        {
            y2[i] = x[i] * 7.9463;
        }
    }
    t_end = clock();
    time = (double)(t_end - t_start) / CLOCKS_PER_SEC;
    printf("multi: %f s\n", time);

    t_start = clock();
    for( j = 0; j < N; j++ )
    {
        res = 0.0;
        for( i = 0; i < x_len; i++ )
        {
            res += x[i];
        }
    }
    t_end = clock();
    time = (double)(t_end - t_start) / CLOCKS_PER_SEC;
    printf("sum: %f s\n", time);

    return y1[0], y2[0], res;
}

// results
// cos: 20.910590 s
// multi: 0.633281 s
// sum: 1.153001 s

Python results:

cos: 22.7199969292 s
multi: 0.841291189194 s
sum: 1.15971088409 s

C results:

cos: 20.910590 s
multi: 0.633281 s
sum: 1.153001 s

As you can see NumPy is incredibly fast, but always a bit slower than pure C.

Sorry, but I have to downvote--the first code you have in this answer is completely broken (has undefined behavior, and at least in my testing dies with a segfault) and your final C code doesn't even seem to *attempt* to do the same things as originally intended. — Jerry Coffin, Dec 29 '16 at 19:31
I can successfully run it (despite returning double in place of int), but anyway, I don't understand your comment. In my opinion it does _exactly_ the same, main loops are identical. Could you explain why my "_final C code doesn't even seem to attempt to do the same things_"? — Gwidryj, Dec 29 '16 at 19:46
@JerryCoffin I would call it rather a small bug, not "_doesn't even seem to attempt to do the same things_". You could just say _put N-1 in j-loop_ instead of repeating "_undefined behavior_". Anyway, timing is almost the same - I just have checked corrected version. — Gwidryj, Dec 29 '16 at 20:05
Thousands of virus writers would like to confirm your idea that a buffer overrun is a small bug, and encourage you to continue doing it as often as possible. As far as not even trying to do the same things, your second code has (for example) `y2[i] = x[i] * 7.9463;`, which doesn't seem to correspond to anything in the question. — Jerry Coffin, Dec 29 '16 at 20:12
@JerryCoffin I think I explained it in a paragraph above, but let others judge this. — Gwidryj, Dec 29 '16 at 20:17
Upvote. I appreciate you doing the cos, multiplication and sum function comparisons! For C the code is considerably more than the Python version. So if Python's tabs dont annoy someone, that is the way to go for a ~2 sec time increase... — analytical_prat, Nov 13 '21 at 22:38
@analytical_prat The general rule is to use the easiest, most expressive, and highest-level language that still runs fast enough. It'll be easier for you as it produces smaller code with fewer chances to have bugs with higher-level operations that further removes possibilities for bugs (e.g. doing a scalar times a matrix versus a for loop to achieve the same thing). Going straight to C++ is a sign you might be doing premature optimization (or perhaps you know that language the best). I suspect C++ template expressions are much faster for nontrivial cases unlike a scalar times a matrix. — user904963, Dec 26 '21 at 06:19
*The benchmark is flawed*: `coefs2`, `y1` and `y2` are written but not read/used. Thus, an optimizing compiler can just remove the loop as they do not have any *side effect* (this is called the "as if rule" in C/C++). In fact, some compilers like Clang actually do that! For example, Clang 13 with `-O3` do not compute `y2`. ICC do even more optimizations: it does not compute both `y1` and `y2`. The OP code already have this issue though with both `coefs1`, `coefs2`... (hopefully mainstream compilers are not so clever to optimize this part yet). — Jérôme Richard, Dec 29 '21 at 13:29

Marine Galantin · Answer 2 · 2022-02-03T16:25:29.723

I am actually surprised that no one mentioned Linear Algebra libraries like BLAS LAPACK MKL and all...

Numpy is using complex Linear Algebra libraries ! Essentially, Numpy is most of the time not built on pure c/cpp/fortran code... it is actually built on complex libraries that take advantage of the most performant algorithms and ideas to optimise the code. These complex libraries are hardly matched by naive implementation of classic linear algebra computations. The simplest first example of improvement is the blocking trick.

I took the following image from the CSE lab of ETH, where they compare matrix vector multiplication for different implementation. The y-axis represents the intensity of computations (in GFLOPs); long story short, it is how fast the computations are done. The x-axis is the dimension of the matrix.

C and C++ are fast languages, but actually if you want to mimic the speed of these libraries, you might have to go one step deeper and use either Fortran or intrinsics instructions (that are perhaps the closest to assembly code you can do in C++).

Consider the question Benchmarking (python vs. c++ using BLAS) and (numpy), where the very good answer from @Jfs, and we observe: "There is no difference between C++ and numpy on my machine."

Some more reference:

Why is a naïve C++ matrix multiplication 100 times slower than BLAS?

one thing to note is that even seemingly simple stuff like array.max is actually extremely optimized (assembly) code - written directly by INTEL/AMD/APPLE So as a general rule, as long as the arrays are large naive c code will be slower. — felix-ht, May 03 '22 at 09:53

Jerry Coffin · Answer 3 · 2021-08-16T16:31:09.713

7

On my computer, your (current) Python code runs in 14.82 seconds (yes, my computer's quite slow).

I rewrote your C++ code to something I'd consider halfway reasonable (basically, I almost ignored your C++ code and just rewrote your Python into C++. That gave me this:

#include <cstdio>
#include <iostream>
#include <cmath>
#include <chrono>
#include <vector>
#include <assert.h>

const unsigned int k_max = 40000;
const unsigned int N = 10000;

template <class T>
class matrix2 {
    std::vector<T> data;
    size_t cols;
    size_t rows;
public:
    matrix2(size_t y, size_t x) : cols(x), rows(y), data(x*y) {}
    T &operator()(size_t y, size_t x) {
        assert(x <= cols);
        assert(y <= rows);
        return data[y*cols + x];
    }

    T operator()(size_t y, size_t x) const {
        assert(x <= cols);
        assert(y <= rows);
        return data[y*cols + x];
    }
};

int main() {
    matrix2<double> data(N, 2);
    matrix2<double> coeffs(k_max, 2);

    using namespace std::chrono;

    auto start = high_resolution_clock::now();

    for (int k = 0; k < k_max; k++) {
        for (int j = 0; j < N - 1; j++) {
            coeffs(k, 0) += data(j, 1) * (cos((k + 1)*data(j, 0)) - cos((k + 1)*data(j+1, 0)));
            coeffs(k, 1) += data(j, 1) * (sin((k + 1)*data(j, 0)) - sin((k + 1)*data(j+1, 0)));
        }
    }

    auto end = high_resolution_clock::now();
    std::cout << duration_cast<milliseconds>(end - start).count() << " ms\n";
}

This ran in about 14.4 seconds, so it's a slight improvement over the Python version--but given that the Python is mostly a pretty thin wrapper around some C code, getting only a slight improvement is pretty much what we should expect.

The next obvious step would be to use multiple cores. To do that in C++, we can add this line:

#pragma omp parallel for

...before the outer for loop:

#pragma omp parallel for
for (int k = 0; k < k_max; k++) {
    for (int j = 0; j < N - 1; j++) {
        coeffs(k, 0) += data(j, 1) * (cos((k + 1)*data(j, 0)) - cos((k + 1)*data(j+1, 0)));
        coeffs(k, 1) += data(j, 1) * (sin((k + 1)*data(j, 0)) - sin((k + 1)*data(j+1, 0)));
    }
}

With -openmp added to the compiler's command line (though the exact flag depends on the compiler you're using, of course), this ran in about 4.8 seconds. If you have more than 4 cores, you can probably expect a larger improvement than that though (conversely, if you have fewer than 4 cores, expect a smaller improvement--but nowadays, more than 4 is a lot more common that fewer).

edited Aug 16 '21 at 16:31

answered Dec 28 '16 at 18:10

Jerry Coffin

476,176
80
629
1,111

2

Well... this is interesting. Your version uses much more c++ structures, but main calculations are the same. I would not expect c++ code (with unnecessary usage of its structures) to be faster than pure c (and here it is faster in the reference to the python version). So I ran your version and original ones and averaged over 10 runs. Results: original c++: 10.8 s, original python 8.5 s, your c++ version: **32.8 s**. So, as expected, I can't reproduce improvement of yours. – Gwidryj Dec 28 '16 at 19:36
@Gwidryj: Clearly you're doing something *seriously* wrong (probably compiling without optimization, but you haven't told us enough for anybody to diagnose your problem with any real certainty). As to using a struct slowing anything down--sorry, but that shouldn't be expected at all. – Jerry Coffin Dec 28 '16 at 20:28
I am using gcc version 4.9.2 (Debian 4.9.2-10). Instruction to compile both c++ scripts: `g++ -std=c++11 main.cpp -o main`. I also checked that without initializing `data` with some finite values your code is ~3 times slower, but with initializing it with non-zeros only ~1.3 times slower. – Gwidryj Dec 29 '16 at 09:27
Try adding -O2 flag for optimization, – Fernando Dec 29 '16 at 13:16
@Fernando with optimization (whole instruction: `g++ -std=c++11 -O2 main.cpp -o main`) it's ~1.8 times slower (this c++ is slower than original one). Flag -O3 doesn't change anything. – Gwidryj Dec 29 '16 at 13:23
@Gwidryj: Using gcc 4.9 may well be a large part of your problem. It's years old. The current release is 6.3. While it's not always necessary to use the *absolute* latest release, I have a hard time imagining a reason to use anything older than 5.x, at least on a mainstream system. – Jerry Coffin Dec 29 '16 at 19:13
@Gwidryj: Oh, one more point: unless your compiler/standard library is completely broken, the values in the vectors *are* initialized (to `0.0`, to be precise). – Jerry Coffin Dec 29 '16 at 19:33
1

@JerryCoffin Yes, by _finite values_ I meant not zeros. – Gwidryj Dec 29 '16 at 19:51
1

@Gwidryj: Zero is literally the furthest possible point from infinity. – Jerry Coffin Dec 29 '16 at 20:07
According to [doc here](https://gcc.gnu.org/onlinedocs/libgomp/Enabling-OpenMP.html), the option to enable openmp should be `-fopenmp`, not `-openmp`. – jdhao Aug 16 '21 at 09:50
@jdhao: The exact flag depends on the compiler you use, of course. – Jerry Coffin Aug 16 '21 at 16:28
@JerryCoffin I'm not sure it makes sense to talk about distance from infinity, because matters of infinity often have unintuitive results such as the set {0, ±1, ±2, ...} being the same size as {0, 1, 2, ...}. However, if you want to intuit about that concept, -1 is further from infinity than 0. – user904963 Dec 26 '21 at 21:59
@user904963: Sort of true, but if we're dealing with signed numbers, then we probably also want signed infinities, in which case -1 is closer to negative infinity. Any non-zero number will be closer to *an* infinity than 0 (to the extent that talking about distance to infinity makes sense, of course). – Jerry Coffin Dec 27 '21 at 08:10

score 3 · Answer 4 · edited Sep 04 '20 at 18:00

3

I tried to understand your Python code and reproduce it in C++. I found that you didn't represent correctly the for-loops in order to do the correct calculations of the coeffs, hence should switch your for-loops. If this is the case, you should have the following:

#include <iostream>
#include <cmath>
#include <time.h>

const int k_max = 40000;
const int N = 10000;

double cos_k, sin_k;

int main(int argc, char const *argv[])
{
    time_t start, stop;
    double data[2][N];
    double coefs[k_max][2];

    time(&start);

    for(int i=0; i<k_max; ++i)
    {
        for(int j=0; j<N; ++j)
        {
            coefs[i][0] += data[1][j-1] * (cos((i+1) * data[0][j-1]) - cos((i+1) * data[0][j]));
            coefs[i][1] += data[1][j-1] * (sin((i+1) * data[0][j-1]) - sin((i+1) * data[0][j]));
        }
    }
    // End of main loop

    time(&stop);

    // Speed result
    double diff = difftime(stop, start);
    std::cout << "Time: " << diff << " seconds" << std::endl;

    return 0;
}

Switching the for-loops gives me: 3 seconds for C++ code, optimized with -O3, while Python code runs at 7.816 seconds.

edited Sep 04 '20 at 18:00

Peter Mortensen

30,738
21
105
131

answered Dec 29 '16 at 15:16

quark

39
2

I have tried the same thinking that [locality of reference](https://en.wikipedia.org/wiki/Locality_of_reference) is the case here. I tried for many values of `k_max` and `N`, with zeros and non-zero values in `data`, but results were always practically the same, regardless the order of loops. – Gwidryj Dec 29 '16 at 15:57
"_I found that you didn't represent correctly the for-loops in order to do the correct calculations of the coeffs_" - this is not true, there is no way switching for-loops can change the result. In both cases you run through the same sets of `(i, j)` pairs, just in different order. – Gwidryj Dec 29 '16 at 16:06
@Gwidryj, Sure, it results in the same computation time. It was my observation in order to fully respect the formulae found on paper. What you didnt do it was the **optimization flag**: g++ main.cpp -o main **-O3** . – quark Dec 30 '16 at 11:28
@Gwidryj, or use -O2. Check what works better for you. – quark Dec 30 '16 at 11:32

score 0 · Answer 5 · answered Dec 26 '21 at 07:06

0

The Python code can't be faster than properly-coded C++ code since Numpy is coded in C, which is often slower than C++ since C++ can do more optimizations. They'll only be around each other with Python running somewhere between the same time as C++ to about twice C++ when doing the majority of your computation in large computations that Python pushes off to compiled binaries to calculate. Most anything beyond large matrix multiplication, addition, scalar on matrix multiplication, etc. will perform much worse in Python. For example, look at the Benchmark Game where people submit solutions to various algorithms in various languages, and the website keeps track of the fastest submissions for each (algorithm, language) pair. You can even view the source code for each submission. For most test cases, Python is 2-15 times slower than C++. That makes sense too if you do anything other than simple math operations - anything with linked lists, binary search trees, procedural code, etc. The interpreted nature of Python combined with it storing metadata for each object (even int, double, float, etc.) significantly bogs things down in a way that no Python programmer can fix.

answered Dec 26 '21 at 07:06

user904963

1,520
2
15
33

It is not true that C++ does more optimisation than pure C. This does not make sense and shows some doubts about the languages: compilers make improvements on your code and the compilers for c and cpp are usually the same (clang, gcc...). – Marine Galantin Feb 03 '22 at 16:08
1

@MarineGalantin C++ has templates. Take a look at the speed of `std::sort` in C++ compared to `qsort` in C. The language allows for more optimizations. Here, the compiler can inline the comparison function since it knows the types in use. The C solution cannot. Things aren't as clear as I thought though. If you look at The Benchmark Game website, C++ beats C severely in some tests but C beats C++ severely in others. Some are draws. The ones where C wins hands down seem to be parallelized whereas the C++ solution wasn't. C++ wins when they are both parallelized. – user904963 Feb 05 '22 at 15:46
That benchmark website shows the complexity of comparing languages if nothing else. Some solutions are parallelized well, others parallelized poorly, others with manual use of SIMD instructions, and all sorts of coding tricks that aren't exactly standard programs written in one language compared to standard programs written in another, relying only on compilers for optimizations. Sometimes, the algorithms used are different too. – user904963 Feb 05 '22 at 15:51

Why is my Python NumPy code faster than C++?

Version 2 of the code

Version 3 of the code

5 Answers5

Linked