I have a linear array arr
which is is the combination of the elements of a number of smaller sub-arrays. The position of each sub-array in the large array is given by an offsets
array.
For example
If arr = [1,2,4,3,6,4,8,9,12,3]
and offsets = [0,2,4,9]
, then the sub-arrays are [1,2,4]
, [3,6]
and [4,8,9,12,3]
.
Now, given the array arr
and offsets
, I would like to perform a parallel reduction on each subarray. Let's assume that we want to find the max()
of each subarray. So, the result, in our example case, will be [4,6,12]
. The important constraint in the case is that the operations should be strictly vectorized, and no explicit iteration in loops is allowed. For instance, in python, something like this isn't allowed:
import numpy
arr = numpy.array([1,2,4,3,6,4,8,9,12,3])
offsets = numpy.array([0,2,4,9])
max_array = numpy.empty(num_sub_arrays)
for i in range(num_sub_arrays): # Eliminate this loop!
max_array[i] = numpy.max(arr[offsets[i]:offsets[i+1]])
How can I go about achieving this? Also, please note that the code should be as general as possible, as I plan on implementing it across multiple hardware (CUDA, AVX etc).
Any implementable hints regarding it will be highly helpful!