1

Recently, I had a debate with a colleague about comparing python vs C++ in terms of performance. Both of us were using these languages for linear algebra mostly. So I wrote two scripts, one in python3, using numpy, and the other one in C++ using Eigen.

Python3 numpy version matmul_numpy.py:

import numpy as np
import time
a=np.random.rand(2000,2000)
b=np.random.rand(2000,2000)
start=time.time()
c=a*b
end=time.time()
print(end-start) 

If I run this script with

python3 matmul_numpy.py

This will return:

0.07 seconds

The C++ eigen version matmul_eigen.cpp:


#include <iostream>
#include <Eigen/Dense>
#include "time.h"
int main(){
        clock_t start,end;
        size_t n=2000;
        Eigen::MatrixXd a=Eigen::MatrixXd::Random(n,n);
        Eigen::MatrixXd b=Eigen::MatrixXd::Random(n,n);
        start=clock();
        Eigen::MatrixXd c=a*b;
        end=clock();
        std::cout<<(double)(end-start)/CLOCKS_PER_SEC<<std::endl;
        return 0;}

The way I compile it is,

g++ matmul_eigen.cpp -I/usr/include/eigen3 -O3 -march=native -std=c++17 -o matmul_eigen

this will return (both c++11 and c++17):

0.35 seconds

This is very odd to me, 1-Why numpy here is so faster than C++? Am I missing any other flags for optimization?

I thought maybe it is because of the python interpreter that it is executing the program faster here. So I compile the code with cython using this thread in stacks.

The compiled python script was still faster (0.11 seconds). This again add two more questions for me:

2- why it got longer? does the interpreter do anymore optimization?

3- why the binary file of the python script (37 kb) is smaller than the c++(57 kb) one ?

I would appreciate any help,

thanks

Nicol Bolas
  • 449,505
  • 63
  • 781
  • 982
  • For Python, you should use `timeit` to properly time code execution (and get an average over multiple runs) – OneCricketeer Jan 07 '23 at 00:21
  • 2
    Do these programs do the same thing? Do you get the same result from both programs? – Ted Lyngmo Jan 07 '23 at 00:36
  • @OneCricketeer, thanks, there was no major difference. – William_____that's_all Jan 07 '23 at 00:41
  • 1
    Honestly, I don't think we can really answer this without looking at the decompiled output of both or knowing how each operates internally. – OneCricketeer Jan 07 '23 at 00:43
  • @TedLyngmo, I am not getting the same results, because I am using random numbers. But I think that is so important, since they both are using float64 datatypes – William_____that's_all Jan 07 '23 at 00:48
  • 4
    _"because I am using random numbers"_ - Ok, you need to use the same random number generator for both. You can't compare a program using a slow PRNG with a program using a fast one. `np.random.rand` may use a more modern PRNG than what's included in Eigen (or standard C++ for that matter). - and, make sure you get the same result. Otherwise comparing is pointless. Seed them both the same way. – Ted Lyngmo Jan 07 '23 at 00:50
  • @TedLyngmo, but that shouldn't make any difference, I am measuring the time only before and after the multiplications! The time spent on random generator is not included. – William_____that's_all Jan 07 '23 at 00:53
  • 2
    Oh, right. Ok, then, make sure that both programs work on exactly the same data. Same input, same output. – Ted Lyngmo Jan 07 '23 at 00:54
  • 3
    Numpy may use a multithreaded implementation of the linear algebra operations, depending on the backend used, see https://numpy.org/devdocs/reference/global_state.html, while `Eigen` doesn't with the options you used. – user17732522 Jan 07 '23 at 00:58
  • 1
    Size of binaries doesn't tell you much. Most of the functionality is in shared libraries that are linked in at runtime anyway. And on the other hand there is some fixed stuff that is always included in a C++ executable, even if it isn't really used. That amount is smaller for a C executable. – user17732522 Jan 07 '23 at 01:01
  • @TedLyngmo: They're not timing the whole program, only the matmul part, so hopefully the PRNG speed doesn't matter. And FP add/mul/FMA speed isn't data-dependent on hardware from this century, except for being very slow with subnormal numbers which is unlikely for a matmul of random numbers. (Or is it? I guess it's worth checking `perf stat -e task-clock,cycles,instructions,fp_assist.any`; any counts for FP assists are traps to microcode to handle subnormals aka denormals.) Perf would also tell you if it's multi-threaded. – Peter Cordes Jan 07 '23 at 02:16
  • @PeterCordes Re: _"PRNG speed doesn't matter"_ - Yes, I got that (eventually). :-) Even if the data processed doesn't matter when it comes to the perf result, then, why not use the same data and noone will question it is my philosophy. – Ted Lyngmo Jan 07 '23 at 02:24
  • 1
    @TedLyngmo: Yeah, that's a good philosophy, although better would be the same implementation of the same PRNG, to give equal "warm up" time. Although in this case, it's easily long enough, and you spend a lot of that time page-faulting in the input arrays vs. just letting them be zeros. (Oh, but the output array might not have been touched yet, so page faults are something to profile for, adding the `page-faults` event to the list in `perf stat` if you're using it. [Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987)) – Peter Cordes Jan 07 '23 at 02:31
  • @PeterCordes Indeed. I'm running my server with standard settings. It just puffs along at a low speed and speeds up whenever anyone does anything fun on it. One friend ([`Mostly mangling`](http://mostlymangling.blogspot.com/)) ran PRNG-testing (in plain user-mode) so that all cores got overheated and throttled(!). It turned out to be a BIOS bug that let that happen. In normal cases, I speed up all cores and run the 2 competing algorithms after eachother a few times before I feel I found a winner - and I always let them deal with the same data. – Ted Lyngmo Jan 07 '23 at 02:42
  • @user17732522, "while Eigen doesn't with the options you used", thanks but I think I did with the O3 option (https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html) and enabled vectorization -march=native (http://eigen.tuxfamily.org/index.php?title=FAQ). I even added -fopenmp (https://eigen.tuxfamily.org/dox/TopicMultiThreading.html) and it made it worse to 0.40 seconds. – William_____that's_all Jan 07 '23 at 04:14
  • 1
    @William_____that's_all `-O3`/`-march=native` and vectorization have nothing to do with multithreading. `-fopenmp` should enable it. Try running the program (both the Eigen variant and the Numpy one) with the `OMP_NUM_THREADS` environment variable set to different values (I would try at least `1` and the number of cores on your cpu) and see what effect that has. – user17732522 Jan 07 '23 at 04:21
  • @user17732522, thank you very much, I didn't know the difference between vectorization and multi-threading, if possible redirect me to a good link. with the things you said, adding -fopenmp flag, and putting this in runtime OMP_NUM_THREADS=12 ./matmul_eigen, I now get 36 miliseconds. I have 16 threads according to htop. Do you think I can still improve? – William_____that's_all Jan 07 '23 at 05:58

2 Answers2

12

The biggest issue is that you are comparing two completely different things:

  • In Numpy, a*b perform an element-wise multiplication since a and b are 2D array and not considered as matrices. a@b performs a matrix multiplication.
  • In Eigen, a*b performs a matrix multiplication, not an element-wise one (see the documentation). This is because a and b are matrices, not just 2D arrays.

The two gives completely different results. Moreover, a matrix multiplication runs in O(n**3) time while an element-wise multiplication runs in O(n**2) time. Matrix multiplication kernels are generally highly-optimized and compute-bound. They are often parallelized by most BLAS library. Element-wise multiplications are memory-bound (especially here due to page-faults). As a result, this is not surprising the matrix multiplication is slower than an element-wise multiplication and the gap is not huge either due to the later being memory-bound.

On my i5-9600KF processor (with 6 cores), Numpy takes 9 ms to do a*b (in sequential) and 65 ms to do a@b (in parallel, using OpenBLAS).

Note Numpy element-wise multiplications like this are not parallel (at least, not in the standard default implementation of Numpy). The matrix multiplication of Numpy use a BLAS library which is generally OpenBLAS by default (this is dependent of the actual target platform). Eigen should also use a BLAS library, but it might not be the same than the one of Numpy.

Also note note that clock is not a good way to measure parallel codes as it does not measure the wall clock time but the CPU time (see this post for more information). std::chrono::steady_clock is generally a better alternative in C++.


3- why the binary file of the python script (37 kb) is smaller than the c++(57 kb) one ?

Python is generally compiled to bytecode which is not a native assembly code. C++ is generally compiled to executable programs that contains assembled binary code, additional information used to run the program as well as meta-informations. Bytecodes are generally pretty compact because they are higher-level. Native compilers can perform optimizations making programs bigger such as loop unrolling and inlining for example. Such optimizations are not done by CPython (the default Python interpreter). In fact, CPython performs no (advanced) optimizations on the bytecode. Note that you can tell to native compilers like GCC to generate a smaller code (though generally slower) using flags like -Os and -s.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
  • Wow! thanks alot! your response was very helpful. I actually applied the changes you and another gentleman in the comments said, and now my compiled eigen c++ beats the python version by 10 miliseconds. Also, the clock_t was not suitable as you said, I used chrono::steady_clock. However, I still have one question, when I compile the python code into binary, it takes longer to execute, what does the python interpreter do that it makes it run faster? – William_____that's_all Jan 07 '23 at 06:13
  • 1
    Good to know :) . For the compilation, CPython generates a bytecode file the first time so not to parse the scripts later. The bytecode is cached. Thus, the first run can be a bit slower. This operation is generally pretty fast, automatic and transparent. Just to be sure to understand your question: how to you "compile the python code into binary"? – Jérôme Richard Jan 07 '23 at 06:19
  • I first generate a C file out of python script with `cython example_file.py --embed`. Then I compile it with g++, linked with python static libs and includes. `gcc -Os $(python3-config --includes) example_file.c -o output_bin_file $(python3-config --ldflags) -l$python3.6m` – William_____that's_all Jan 07 '23 at 06:32
  • Cython should not help much in this case. It will generate a native code doing what CPython do but it will not be significantly faster because 1. most of the time should be spent in the Numpy code that Cython cannot optimize and 2. most of the overhead of CPython (eg. dynamic typing, object allocation, reference counting, variable-sized integers, etc.) are still present unless annotations are added so to remove them. Cython cannot make Numpy faster because the code of Numpy is already compiled. – Jérôme Richard Jan 07 '23 at 06:39
  • I might be slower regarding how the program is linked: if the compiled implementation use a slower BLAS implementation, then the code can be significantly slower. Besides this, using Cython should not significantly impact the execution time of Numpy calls (at least not at the granularity of >1ms). – Jérôme Richard Jan 07 '23 at 06:43
  • uhum I understand. Something even more weird happened. the binary file output from the gcc compilation, if I change the OMP_NUM_THREAD env variable, it will change its speed significantly. if I do OMP_NUM_THREADS=8 ./output_bin_file it will make the code run at 0.047 seconds which is almost half of the 0.078 seconds with the interpreter. I think what you were implying is correct; since numpy is already compiled with omp implemntation, the compiled script using it can improve its performance with OMP_NUM_THREAD increased. Thanks alot for you explanations – William_____that's_all Jan 07 '23 at 06:53
  • Numpy uses OpenBLAS by default which itself use OpenMP internally. But Numpy itself does not currently use OpenMP. Changing `OMP_NUM_THREADS` set the number of threads of OpenBLAS, hence impact the performance of the matrix multiplication. Note that BLAS implementations do not always uses OpenMP so this behaviour is dependent of the target platform. – Jérôme Richard Jan 07 '23 at 17:22
  • You are welcome. Feel free to validate the answer if it is clear for you ;) . – Jérôme Richard Jan 07 '23 at 17:22
2

So based on what I learned from the @Jérôme Richard response and the comments @user17732522. It seems that I made two mistakes in the comparison,

1- I made a mistake defining multiplication in the python script, it should be np.matmul(a,b) or np.dot(a,b) or a@b. not a*b which is a elementwise multiplication. 2- I didn't measure the time in C++ code correctly. clock_t doesn't work right for this calculation, std::chrono::steady_clock works better.

With applying these comments, the c++ eigen code is 10 times faster than the python's.

The updated code for matmul_eigen.cpp:

#include <iostream>
#include <Eigen/Dense>
#include <chrono>
int main(){

    size_t n=2000;
    Eigen::MatrixXd a=Eigen::MatrixXd::Random(n,n);
    Eigen::MatrixXd b=Eigen::MatrixXd::Random(n,n);
    auto t1=std::chrono::steady_clock::now();
    Eigen::MatrixXd c=a*b;
    auto t2=std::chrono::steady_clock::now();
    std::cout<<(double)std::chrono::duration_cast<std::chrono::microseconds>(t2-t1).count()/1000000.0f<<std::endl;

    return 0;}

To compile, both the vectorization and multi-thread flags should be considered.

g++ matmul_eigen.cpp -I/usr/include/eigen3 -O3 -std=c++17 -march=native -fopenmp -o eigen_matmul

To use the multiple threads for running the code:

OMP_NUM_THREADS=4 ./eigen_matmul

where "4" is the number of CPU(s) that openmp can use, you can see how many you have with:

lscpu | grep "CPU(s):"

This will return 0.104 seconds.

The updated python script matmul_numpy.py:

import numpy as np
import time
a=np.random.rand(2000,2000)
b=np.random.rand(2000,2000)

a=np.array(a, dtype=np.float64)
b=np.array(b, dtype=np.float64)

start=time.time()
c=np.dot(a,b)
end=time.time()
print(end-start)

To run the code,

python3 matmul_numpy.py

This will return 1.0531 seconds.

About the reason that it is like this, I think @Jérôme Richard response is a better reference.

  • 1
    Since Numpy should call the `dgemm` BLAS call internally, and the BLAS should be the same, the performance of Eigen and Numpy should be similar (with the same configuration). The Numpy code may be a bit slower due to Numpy overhead but I do not expect more than a 50% overhead in this case regarding the input size and the provided timings. Thus, I advise you to check the Numpy configuration and more specifically the linked BLAS. See [this post](https://stackoverflow.com/questions/9000164) for more informations (there are many other posts like this). – Jérôme Richard Jan 07 '23 at 23:56
  • 1
    Note that `OMP_NUM_THREADS` should be automatically set by the OpenMP runtime to the number of core (or thread regarding the target runtime) assuming it is not already set. `OMP_DISPLAY_ENV=TRUE` help to get more information about the OpenMP runtime configuration. Note that using multiple OpenMP runtime can sometime cause performance issues. Also note that thread are sometimes badly bound to cores by default and a manual thread-binding is required. You can play with `OMP_PLACES` to check/tune that (as well as `OMP_PROC_BIND`). – Jérôme Richard Jan 08 '23 at 00:00
  • @JérômeRichard By default Eigen uses its own BLAS implementation. If the version used by Python is not optimized for the local CPU, and only Eigen has multithreading activated, then being 10 times faster sounds at least plausible. – chtz Jan 08 '23 at 17:43
  • @chtz The documentation say that a dgemm call is performed. But, indeed, it looks like they do not an external BLAS implementation unless `EIGEN_USE_BLAS` is specified if my understanding is correct. Still the default OpenBLAS implementation should be pretty efficient in this case (>70% of the optimal time on most platforms) and parallel by default. A x10 performance difference suggest that something is clearly going wrong. OpenBLAS has some thread mapping issue on some platform (eg. IBM machines) that can cause this. A bad configuration is also possible. – Jérôme Richard Jan 08 '23 at 23:38