I have read many times about vectorized code in numpy
. I know for a fact that a python for loop can be ~100x times slower than an equivalent numpy
operation. However, I thought that the power of numpy
vectorization was beyond a mere translation of a for loop from Python to C/Fortran code.
I have read here and there about SIMD (Single Instruction, Multiple Data), BLAS (Basic Linear Algebra Subprograms) and other low level stuff, without a clear understanding of what is going on under the hood, and what I thought was that those libraries, thanks to parallelization at the CPU level, were able to perform the operations so that they scale in a sublinear fashion.
Maybe an example will help clarify this. Let's say we wish to compute the product of two matrices, and I want to check how increasing the number of rows of the first matrix will affect the elapsed time (this operation has a lot to do with machine learning, if we think that the number of rows is the batch size, the left matrix is the data and the right matrix contains parameters of the model). Well, my naïve understanding was that, in this case, the total elapsed time will scale (up to some limit) sub linearly, so that, in principle, and as long as everything fits into the RAM, I expected that increasing the bath size was always a good idea.
I've made some benchmarks and the situation is not what I expected. It looks like growth is linear, and given that the number of operations is a linear function on the number of rows, it looks like the C/Fortran code that runs under the hood is merely doing for loops.
This is the code:
k = 128
N = 100
time_info = {}
for size in list(range(100, 5000, 200)):
A, B = np.random.random((size, k)), np.random.random((k, k))
t = time()
for i in range(N):
np.dot(A, B)
delta = time() - t
time_info[size] = delta / N
x = np.array(list(time_info.keys()))
y = np.array(list(time_info.values())) * 1000
# Plotting the Graph
plt.plot(x, y)
plt.title("Elapsed time vs number of rows")
plt.xlabel("Number of rows of left matrix")
plt.ylabel("Time in miliseconds")
plt.show()
It looks like the trend is quite linear. By the way, I've checked np.show_config()
and it shows that I have openblas
.
So my questions are:
- What is exactly the meaning of vectorization?
- Is the benchmark reasonable, and what you would have expected?
- If so, is there any noticeable optimization thanks to low level stuff like SIMD that should have an impact with vectorized operations? Or, maybe, it will only have an impact when you go from very small vectors/matrices to medium size vectors/matrices?
This last option could make sense if only when the size is very small the CPU is not fully occupied. For example, and correct me if I am saying something stupid, if we had an architecture capable of performing 8 mathematical operations in parallel, then you would expect that multiplying a (1,) vector times a (1,) vector will be as fast as multiplying a (8,) vector times a (1,) vector. So, for very small sizes, the gain will be huge (8x times gain). But if you have vectors of thousand of elements, then this effect will be negligible and the time will scale linearly, because you will always have the CPU fully occupied. Does this naïve explanation make sense?