Understanding numpy's vectorization of loops

Question

I want to verify that I've understood the concept of vectorized code that is mentioned in many Machine Learning lectures/notes/videos.

I did some reading on this and found that CPU's and GPU's have an instruction set called SIMD; single instruction multiple data.

This works for example by moving two variables to two special 64/128 bit registers, then adding all the bits at once.

I've also read that with most modern compilers like GCC for example, if you turn on optimization with the -Ofast flag, which is

-Ofast - Disregard strict standards compliance. -Ofast enables all -O3 optimizations. It also enables optimizations that are not valid for all standard-compliant programs. It turns on -ffast-math and the Fortran-specific -fno-protect-parens and -fstack-arrays.

The -Ofast should then auto-vectorize any loops written in C/C++ when possible to SIMD instructions.

I tested this out on my own code and was able to get a significant speedup on MNIST dataset from 45 minutes down to 5 minutes.

I am also aware that numpy is written in C and wrapped with PyObjects. I read through a lot of their code but it is difficult.

My question is then: is my understanding above correct, and does Numpy also do the same thing, or do they use explicit pragmas or other special instruction/register names for their vectorization?

Matthieu Brucher · Accepted Answer · 2018-12-20T17:09:24.300

1

numpy doesn't do anything like that.

The term vectorization in numpy context means that you make numpy work on your array directly rather than making a loop yourself. It is usually then passed to what is call "universal functions", or "ufunc" for short. These functions are C functions that will process in C in a C for loop the operation that is intended.

But it usually cannot do any ISA vectorization. The reason is that these functions are universal for all types of arrays, dense or views on these dense arrays. As such, due to the pattern that is used, you cannot expect vectorization.

If you want ISA vectorized numpy calls, you can use numba which JIT can JIT (and thus really ISA vectorize). There is another project that would use one of Intel's libraries, but I can't find it anymore.

edited Dec 20 '18 at 17:09

answered Dec 20 '18 at 17:03

Matthieu Brucher

21,634
7
38
62

Ok, so numpy pushes the loops down to C and they're faster becasue there's no type checking, etc. Aside from that, if I write my own C loops, can they be turned into ISA vectorizations by simply using `-Ofast`? – Sam Hammamy Dec 20 '18 at 18:20
Hi, my understanding is limited, but doesn't this ans contradict [this ans](https://stackoverflow.com/a/45798012/4553309)? – Shihab Shahriar Khan Aug 11 '20 at 19:16
There are a very small number of loops that can be vectorized in a very small context. Even then, vectorization may not be worth it as you would spend more time waiting on data. – Matthieu Brucher Aug 12 '20 at 21:14

Understanding numpy's vectorization of loops

1 Answers1