numpy.dot 100 times slower than native C++11

Question

I have a Matlab-background, and when I bought a laptop a year ago, I carefully selected one which has a lot of compute power, the machine has 4 threads and it offers me 8 threads at 2.4GHz. The machine proved itself to be very powerful, and using simple parfor-loops I could utilize all the processor threads, with which I got a speedup near 8 for many problems and experiments.

This nice Sunday I was experimenting with numpy, people often tell me that the core business of numpy is implemented efficiently using libblas, and possibly even using multiple cores and libraries like OpenMP (with OpenMP you can create parfor-like loops using c-style pragma's).

This is the general approach for many numerical and machine learning algorithms, you express them using expensive high-level operations like matrix multiplications, but in an expensive, high-level language like Matlab and python for comfort. Moreover, c(++) allows us to bypass the GIL.

So the cool part is that linear algebra-stuff should process really fast in python, whenever you use numpy. You just have the overhead of some function-calling, but then if the calculation behind it is large, that's negligible.

So, without even touching the topic that not everything can be expressed in linear algebra or other numpy operations, I gave it a spin:

t = time.time(); numpy.dot(range(100000000), range(100000000)); print(time.time() - t)
40.37656021118164

So I, these 40 seconds I saw ONE of the 8 threads on my machine working for 100%, and the others were near 0%. I didn't like this, but even with one thread working I'd expect this to run in approximately 0.something seconds. The dot-product does 100M +'es and *'es, so we have 2400M / 100M = 24 clock ticks per second for one +, one * and whatever overhead.

Nevertheless, the algorithm needs 40* 24 =approx= 1000 ticks (!!!!!) for the +, * and overhead. Let's do this in C++:

#include<iostream>

int main() {
  unsigned long long result = 0;
  for(unsigned long long i=0; i < 100000000; i++)
    result += i * i;
  std::cout << result << '\n';
}

BLITZ:

herbert@machine:~$ g++ -std=c++11 dot100M.cc 
herbert@machine:~$ time ./a.out
662921401752298880

real    0m0.254s
user    0m0.254s
sys 0m0.000s

0.254 seconds, almost 100 times faster than numpy.dot.

I thought, maybe the python3 range-generator is the slow part, so I handicapped my c++11 implementation by storing all 100M numbers in a std::vector first (using iterative push_back's), and than iterating over it. This was a lot slower, it took a little below 4 seconds, which still is 10 times faster.

I installed my numpy using 'pip3 install numpy' on ubuntu, and it started compiling for some time, using both gcc and gfortran, moreover I saw mentions of blas-header files passing through the compiler output.

For what reason is numpy.dot so extremely slow?

Don't "handicap" your C++, use instead numpy all the way in the Python side. Did you try numpy.arange or numpy.linspace? — dsign, Jun 21 '15 at 14:43
Of course, when I'm not using numpy adequately, please let me know. — Herbert, Jun 21 '15 at 14:44
Your python version is measuring the time to create two huge lists, plus the time to create two arrays from those lists, plus the dot product. The two codes don't do the same thing — talonmies, Jun 21 '15 at 14:44
I'm not asking numpy to create the lists actually, if it does, imho, that's numpy's decision. Moreover, when I create one std::vector in c++, it take approximately 4 seconds, like stated. — Herbert, Jun 21 '15 at 14:46
@Herbert: `range` is a python builtin which has nothing to do with numpy at all. Please learn a little about the language before you start critiquing its performance — talonmies, Jun 21 '15 at 14:50
@Herbert: just because you can't find the linkage to blas doesn't mean it's not there -- first of all, it's pretty likely it's "indirectly" used through LAPACK, and secondly, the libraries used to build numpy depend on their availability at build time. My numpy clearly links against BLAS and LAPACK, and had I chosen to use multithreaded implementations of these, the operations would be multithreaded. — Marcus Müller, Jun 21 '15 at 14:57
@talonmies: I was actually asking a question, which you should not interpret as critique. Moreover, part of the question is why numpy does not use all 8 threads. — Herbert, Jun 21 '15 at 18:43
@talonmies: I also noticed that numpy.array(numpy.arange(100000000)) (0.24 secs) is a lot faster than numpy.array(range(100000000)) (~15 secs). I guess python-data structures and numpy/scipy structures should be carefully applied at the same time. I'm used to Matlab and mex, where (imho) writing a Matlab function in C++ is less complicated. — Herbert, Jun 21 '15 at 18:52
As an aside, this kind of benchmarks in C++ can be hard to get right. [Clang is "20 times faster" than GCC at `-O2`](http://coliru.stacked-crooked.com/a/9324608c0e6227cf) on this code because it computed the result of the loop at compile time. — T.C., Jun 22 '15 at 01:13
Let me be clear about this: the reason I created the c++11 example, was to get an idea of how fast this could be calculated, not to state that python is worse. The problem I have is that it takes a long time to do the numpy.dot-calculation, even though it can use 8 cores (but just uses one). The real world problem I'm trying to solve, is to multiply a 4M x 40k matrix with a 40k x 60k one, both sparse matrices, and apply a row-wise max() afterwards, leaving me with 4M numbers. This took me over 2 hours in a naive implementation. — Herbert, Jun 22 '15 at 07:42
@ Herbert: If you want to write C or C++ using Python, you should be using Cython. It allows you to interface C and C++ code with Python easily, writing the C or C++ code in a Python-like syntax. If you are looking for something closer to mex, you want to use Cython. — TheBlackCat, Jun 22 '15 at 12:44
related: [Benchmarking (python vs. c++ using BLAS) and (numpy)](http://stackoverflow.com/q/7596612/4279) — jfs, Jun 22 '15 at 13:05

Daniel · Answer 1 · 2015-06-21T14:58:51.093

12

So your comparison is unfair. In your python example, you first generate two range objects, convert them to numpy-arrays and then doing the scalar product. The calculation takes the least part. Here are the numbers for my computer:

>>> t=time.time();x=numpy.arange(100000000);numpy.dot(x,x);print time.time()-t
1.28280997276

And without the generation of the array:

>>> t=time.time();numpy.dot(x,x);print time.time()-t
0.124325990677

For completion, the C-version takes roughly the same time:

real    0m0.108s
user    0m0.100s
sys 0m0.007s

edited Jun 21 '15 at 14:58

answered Jun 21 '15 at 14:48

Daniel

42,087
4
55
81

This is significantly better, but it's still using just one of my 8 threads. – Herbert Jun 21 '15 at 14:50
@Herbert: In which way is your C++ program using multiple threads? – Marcus Müller Jun 21 '15 at 14:51
It isn't, which I also never stated. – Herbert Jun 21 '15 at 14:52
1

@Herbert, python uses global interpreter lock (GIL), so it will only use one processor. – Akavall Jun 21 '15 at 14:53
@Akavall While typically true, when you use some of these libraries which implement things on the native side, they can use parallel algorithms for the native implementation. I think the reason `dot` might not be multithreaded is just that it's very unusual to try to take the dot product of a 100 million-component vector. Far more common is taking the dot product of a million small vectors in parallel, e.g. A lot of numpy's low-level core routines aren't multithreaded, only the higher-level ones. – Jun 21 '15 at 14:55
It is my understanding that python supports calling C(++(11))-code, which is not limited by the GIL. Moreover, I'm assuming numpy's linear algebra operations are not implemented in python, but in C or FORTRAN or whichever suits best, which than allows the implementer to get around the GIL and use multiple calculation units of the CPU or GPU. – Herbert Jun 21 '15 at 14:56
@Herbert: exactly, you didn't state your C++ code was multithreaded. Neither did whoever compiled the numpy you're using state that he's using a multithread libBLAS implementation. – Marcus Müller Jun 21 '15 at 14:59
Good. This sounds promising and I will look into this, but it may take some time to figure this out. – Herbert Jun 21 '15 at 14:59
1

@Herbert: Is http://wiki.scipy.org/ParallelProgramming the source you were referring to? – Marcus Müller Jun 21 '15 at 15:00
@MarcusMüllerꕺꕺ, very useful link, apparently I was wrong in my above comment. From the link: "while numpy is doing an array operation, python also releases the GIL". – Akavall Jun 21 '15 at 15:11
3

Also note that any discussion of BLAS is irrelevant because the usage case in the question is an integer dot product, and that doesn't exist in the standard BLAS. I'm not aware of any level 1 BLAS dot implementations using multiple threads because there is normally no advantage in doing so on what is normally a memory bandwidth limited calculation. SIMD instuctions are helpful, threads are not. – talonmies Jun 21 '15 at 17:23

JonDeen · Answer 2 · 2015-06-21T15:07:00.753

1

range generates a list based on your given parameters, where as your for loop in C merely increments a number.

I agree that it seems fairly costly computationally wise to spend so much time on generating one list-- then again, it is a big list, and you're requesting two of them ;-)

EDIT: As mentioned in comments range generates lists, not arrays.

Try substituting your range method with an incrementing while loop or similar and see if you get more tolerable results.

edited Jun 21 '15 at 15:07

answered Jun 21 '15 at 14:55

JonDeen

41
8

Pretty sure that a python while loop will be much much slower. – Marcus Müller Jun 21 '15 at 14:58
@MarcusMüllerꕺꕺ Indeed it is. Horribly much so. I apologize for distraction! – JonDeen Jun 21 '15 at 15:04
1

@talonmies: if this is python3, it is a generator, which numpy will have to call n times to generate the array on which it will work – Marcus Müller Jun 21 '15 at 15:08
@MarcusMüllerꕺꕺ: and indeed this is Python 3 (its buried in the question), so this is a generator which will be processed by numpy into an array – talonmies Jun 21 '15 at 17:19

numpy.dot 100 times slower than native C++11

2 Answers2

Linked