2

I don't know that much about vectorization, but I am interested in understanding why a language like python can not provide vectorization on iterables through a library interface, much like it provides threading support. I am aware that many numpy methods are vectorized, but it can be limiting to have to work with numpy for generic computations.

My current understanding is that python is not capable of vectorizing functions even if they match the "SIMD" pattern. For instance, in theory shouldn't any list comprehension or use of the map() function be vectorizable because they output a list which is the result of running the same function on independent inputs from an input list?

With my niave understanding, it seems that anytime I use map(), in theory, I should be able to create an instruction set that represents the function; then each element in the input just needs to be run through the same function that was compiled. What is the technical challenge to designing a tool that provides simd_map(func, iterable), which attempts to compile func "just in time" and then grabs batches of input from iterable and utilizes the processor's simd capabilities to run those batches through func()?

Thanks!

SciGuy
  • 29
  • 2
  • Ok, and the fact that lists can hold objects of any type, and they can be mixed collections - does that factor in? The flexibility of lists comes with a price - Python has to check each object in turn. If you have an array, it's of a fixed `dtype` so no checks are necessary (unless it's of type `object`) – roganjosh Sep 20 '19 at 21:45
  • 1
    Base Python provides the flexibility. You have `numpy` for vectorized calculations and `numba` for JIT. – roganjosh Sep 20 '19 at 21:48
  • It could, *if* you could guarantee that `func` was a pure function. – chepner Sep 20 '19 at 21:53
  • @roganjosh sure, but why isn't type management just another part of func()? I do get that it python code would still be slower than C, but I still don't see any fundamental issue with doing JIT compilation then SIMD processing. Numba seems to roughly provide this functionality, but it also seems like it would be a pain to debug Numba code as there are some set of allowed calls and objects you can use in func() for numba. The example they give is that you can not use a pandas DF built-in when making a vectorizable func – SciGuy Sep 20 '19 at 22:40
  • If you want a language designed to be similarly expressive as Python but with a focus on runtime performance, might I recommend [Julia](https://julialang.org/benchmarks/)? Python wasn't *designed* (and CPython wasn't implemented) for numeric computing -- it got shoehorned into the role later, after being adopted for reasons that had nothing to do with any kind of design focus on runtime performance -- and it's only suitable for the purpose when 3rd-party addons are tacked on top... which means you need to actually *use* those 3rd-party addons in your code to get their benefit. – Charles Duffy Sep 22 '19 at 21:24

2 Answers2

6

The operation applied by map is arbitrary Python code. CPython is an interpreter, not a JIT compiler.

CPython could maybe have some canned C functions (ahead-of-time compiled as part of the interpreter) for SIMD operations over arrays, but it doesn't AFAIK. Even so, it would have to optimize the supplied func down to something it could do pattern-recognition on to notice that it was e.g. doing a[i] = max(a[i], some_value).

But normally CPython doesn't do that; interpreter overhead is a huge problem for looping over elements of an array. CPython is nowhere near the performance of a native scalar loop so there's huge room for gains even without auto-vectorization. Like factors of 200 slower IIRC. e.g. Why are bitwise operators slower than multiplication/division/modulo? shows that some operations don't even have a "fast-path" for small integers, and that overhead is enough to make & slower than // which internally uses a hardware division instruction.

Also, Python lists aren't stored as simple contiguous arrays of int32_t or double so CPU SIMD isn't efficient anyway. That's why numpy arrays are special: they do store values like a C array of primitive types.


(Caveat: I barely know Python and don't use it regularly. But I think I know enough for this answer to be correct: It's an interpreter written in C that doesn't do any on-the-fly generation of native machine code. The only native loops that can run are ones pre-compiled as part of the interpreter, or in NumPy libraries.)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
0

You want vectorization or JIT compilation use numba, pypy or cython but be warned the speed comes at the cost of flexibility.

numba is a python module that will jit compile certain functions for you but it does not support many kinds of input and barfs on some (many) python constructs. It is really fast when it works but can be difficult to wrangle. It is also very targeted at working with numpy arrays.

pypy is a complete replacement for the cpython interpreter that is a JIT. It supports the entire python spec but does not integrate with extensions well so some libraries will not work.

cython is an extension of python which compiles to a binary which will behave like a python module. However it does require you to use special syntax to take advantage of the speed gains and requires you to explicitly declare things as ctypes to real get any advantage.

My recommendation is: use pypy if you are pure python. (if it works for you it's basically effortless) Use numba if you need to speed up numeric calculations that numpy doesn't have a good way to do. Use cython if you need the speed and the other 2 don't work.

user1424589
  • 155
  • 4
  • 1
    "Why can't python vectorize map() or list comprehensions". How does this answer that at all? – roganjosh Sep 20 '19 at 23:07
  • @roganjosh: this answer is basically saying "it maybe can when using implementations of the language other than CPython". The question asks about Python (the language), and only implicitly about CPython (the default/standard implementation). Upvoted as a useful addition to my answer about CPython, especially for anyone that's interested in practical results. – Peter Cordes Sep 21 '19 at 02:46
  • 1
    @PeterCordes, ...speaking as someone who *does* know Python fairly well, I don't find this to be a useful answer. Yes, it speaks to plugins or alternative runtime implementations that allow Python better runtime performance, but it doesn't speak directly to the ability of *any* of those to vectorize list comprehensions or `map()` function output, which is the specific question asked. – Charles Duffy Sep 23 '19 at 13:26
  • @CharlesDuffy: good point. PyPy depends on JIT compiling, and most JIT compilers don't have time to look for auto-vectorization with SIMD. e.g. Java's HotSpot JIT took years to get any SIMD, and then only for simple loops. [Do any JVM's JIT compilers generate code that uses vectorized floating point instructions?](//stackoverflow.com/a/20267515) Ahead-of-time compilers can spend more time optimizing. – Peter Cordes Sep 23 '19 at 18:59