I want to verify that I've understood the concept of vectorized code that is mentioned in many Machine Learning lectures/notes/videos.
I did some reading on this and found that CPU's and GPU's have an instruction set called SIMD; single instruction multiple data.
This works for example by moving two variables to two special 64/128 bit registers, then adding all the bits at once.
I've also read that with most modern compilers like GCC for example, if you turn on optimization with the -Ofast
flag, which is
-Ofast - Disregard strict standards compliance. -Ofast enables all -O3 optimizations. It also enables optimizations that are not valid for all standard-compliant programs. It turns on -ffast-math and the Fortran-specific -fno-protect-parens and -fstack-arrays.
The -Ofast
should then auto-vectorize any loops written in C/C++ when possible to SIMD instructions.
I tested this out on my own code and was able to get a significant speedup on MNIST dataset from 45 minutes down to 5 minutes.
I am also aware that numpy is written in C and wrapped with PyObjects. I read through a lot of their code but it is difficult.
My question is then: is my understanding above correct, and does Numpy also do the same thing, or do they use explicit pragmas
or other special instruction/register
names for their vectorization?