Parallel vectorized reduction of subarrays in array without explicit iteration

Question

I have a linear array arr which is is the combination of the elements of a number of smaller sub-arrays. The position of each sub-array in the large array is given by an offsets array.

For example

If arr = [1,2,4,3,6,4,8,9,12,3] and offsets = [0,2,4,9], then the sub-arrays are [1,2,4], [3,6] and [4,8,9,12,3].

Now, given the array arr and offsets, I would like to perform a parallel reduction on each subarray. Let's assume that we want to find the max() of each subarray. So, the result, in our example case, will be [4,6,12]. The important constraint in the case is that the operations should be strictly vectorized, and no explicit iteration in loops is allowed. For instance, in python, something like this isn't allowed:

import numpy
arr = numpy.array([1,2,4,3,6,4,8,9,12,3])
offsets = numpy.array([0,2,4,9])
max_array = numpy.empty(num_sub_arrays)

for i in range(num_sub_arrays):     # Eliminate this loop!
    max_array[i] = numpy.max(arr[offsets[i]:offsets[i+1]])

How can I go about achieving this? Also, please note that the code should be as general as possible, as I plan on implementing it across multiple hardware (CUDA, AVX etc).

Any implementable hints regarding it will be highly helpful!

Are you looking to get `max` reduction only? Look into `ufunc.reduceat`. — Divakar, Jul 03 '18 at 11:03
The reduction can be any defined function ( prefix sum for example). But for this question, let's go with `max()`. @Divakar Thanks! will do — Roy_123, Jul 03 '18 at 11:05
Why the downvote? At least give a comment or an answer before downvoting . — Roy_123, Jul 03 '18 at 11:49
I removed the CUDA tag because this question clearly isn't an on-topic CUDA programming question. Please don't tag spam. — talonmies, Jul 03 '18 at 14:23
I think its unreasonable to ask for general code that is readily adaptable to different architectures. At a high level, you're asking for a segmented reduction. I wouldn't try and drive the generalization down to a lower level than that. In CUDA, this is a solved problem, and you should not be writing your own "general" code for it without a very good reason. An explicit [reduction in AVX](https://software.intel.com/en-us/blogs/2016/04/12/can-you-write-a-vectorized-reduction-operation) on the other hand is going to look quite different (more like a warp shuffle reduction in CUDA). — Robert Crovella, Jul 03 '18 at 14:28
Stated another way, I think you're likely to end up with better results by saying "I want to do a segmented parallel reduction" and then determine individually, for each architecture you care about, what is the best way to do that. They may look very different, and trying to drive a general algorithm down to a lower level, before architectural specialization, may give you significantly less performant code. — Robert Crovella, Jul 03 '18 at 14:29
segmented parallel reductions in CUDA can be readily performed using existing open-source libraries like [thrust](https://thrust.github.io/doc/group__reductions.html) (`reduce_by_key`) and [CUB](https://nvlabs.github.io/cub/structcub_1_1_device_segmented_reduce.html) (`DeviceSegmentedReduce`), and I can't imagine why you wouldn't use those for the CUDA implementation. — Robert Crovella, Jul 03 '18 at 14:32
@RobertCrovella thrust can be called from host code only. I need to perform operations on device itself. I didn't use CUB before. I will look into it — Roy_123, Jul 04 '18 at 04:24
@RobertCrovella I believe that's supported starting from thrust 1.8. And CUDA includes that in it's standard installation from CUDA 9.2. The end user may not have such an updated version of CUDA. I ended up coding it up based on parallel scan as given in GPU Gems. Thanks for giving the correct name ( segmented scan) for the problem. — Roy_123, Jul 04 '18 at 15:38
Thrust 1.8 has been available in CUDA since CUDA 7, three years ago. See [here](https://stackoverflow.com/questions/28150098/how-to-use-thrust-to-sort-the-rows-of-a-matrix/28254765#28254765). — Robert Crovella, Jul 04 '18 at 17:30
@RobertCrovella The official [docs](https://docs.nvidia.com/cuda/archive/9.1/thrust/index.html#installation-and-versioning) for v 9.1 disagrees. Maybe it has not been updated? — Roy_123, Jul 05 '18 at 04:36
It is quite a simple problem if you are allowed to use a compiler like Numba. This may also be the fastest version (every vectorized command is some loop written usualy in cython, C or Fortran) — max9111, Jul 05 '18 at 12:16
@max9111 True. This function loop as I have given is tailor made for numba. I however was hoping to see a better parallel algorithm for the solution. — Roy_123, Jul 05 '18 at 15:14

Parallel vectorized reduction of subarrays in array without explicit iteration

0 Answers0