Returning the minimum and maximum elements of an array in CUDA

Question

I'm performing some array manipulation/calculation in CUDA (via the Cudafy.NET library, though I'm equally interested in CUDA/C++ methods), and need to calculate the minimum and maximum values that are in the array. One of the kernels looks like this:

    [Cudafy]
    public static void UpdateEz(GThread thread, float time, float ca, float cb, float[,] hx, float[,] hy, float[,] ez)
    {
        var i = thread.blockIdx.x;
        var j = thread.blockIdx.y;

        if (i > 0 && i < ez.GetLength(0) - 1 && j > 0 && j < ez.GetLength(1) - 1)
            ez[i, j] =
                ca * ez[i, j]
                + cb * (hx[i, j] - hx[i - 1, j])
                + cb * (hy[i, j - 1] - hy[i, j])
                ;
    }

I'd like to do something like this:

    [Cudafy]
    public static void UpdateEz(GThread thread, float time, float ca, float cb, float[,] hx, float[,] hy, float[,] ez, out float min, out float max)
    {
        var i = thread.blockIdx.x;
        var j = thread.blockIdx.y;

        min = float.MaxValue;
        max = float.MinValue;

        if (i > 0 && i < ez.GetLength(0) - 1 && j > 0 && j < ez.GetLength(1) - 1)
        {
            ez[i, j] =
                ca * ez[i, j]
                + cb * (hx[i, j] - hx[i - 1, j])
                + cb * (hy[i, j - 1] - hy[i, j])
                ;

            min = Math.Min(ez[i, j], min);
            max = Math.Max(ez[i, j], max);

        }
    }

Anyone knows of a convenient way to return the minimum and maximum values (for the entire array, not just per thread or block)?

min and max values are traditionally found via a reduction operation. I'm not too familiar with Cudafy, but that doesn't really look like a reduction. — alrikai, Apr 01 '13 at 18:31
@alrikai I'll happily slaughter and rehydrate my code to solve this. I've looked at map/reduce, etc., but implementation is a bit obscure. Forget the cudafy part: how would you do it in straight CUDA/C++? — 3Dave, Apr 01 '13 at 22:17
If you decide to use `thrust`, look at [this example](https://github.com/thrust/thrust/blob/master/examples/minmax.cu). — BenC, Apr 02 '13 at 05:58
A typical parallel reduction should work. The code I wrote for [this question](http://stackoverflow.com/questions/15733182/count3s-in-cuda-is-very-slow) could be easily adapted, for example. — Robert Crovella, Apr 02 '13 at 12:42
@alrikai it's not - it's an emag wave simulator. I was trying to just calculate the min and max values in the array while they're being updated (since I already have to touch each element), but it didn't seem possible without having a separate kernel that performed a reduction. — 3Dave, Apr 02 '13 at 22:55

score 3 · Answer 1 · answered Jul 25 '16 at 13:49

If you are writing an electromagnetic wave simulator and do not want to re-invent the wheel, you can use thrust::minmax_element. Below I'm reporting a simple example on how using it. Please, add your own CUDA error check.

#include <stdio.h>

#include <cuda_runtime_api.h>

#include <thrust\pair.h>
#include <thrust\device_vector.h>
#include <thrust\extrema.h>

int main()
{
    const int N = 5;

    const float h_a[N] = { 3., 21., -2., 4., 5. };

    float *d_a;     cudaMalloc(&d_a, N * sizeof(float));
    cudaMemcpy(d_a, h_a, N * sizeof(float), cudaMemcpyHostToDevice);

    float minel, maxel;
    thrust::pair<thrust::device_ptr<float>, thrust::device_ptr<float>> tuple;
    tuple = thrust::minmax_element(thrust::device_pointer_cast(d_a), thrust::device_pointer_cast(d_a) + N);
    minel = tuple.first[0];
    maxel = tuple.second[0];

    printf("minelement %f - maxelement %f\n", minel, maxel);

    return 0;
}

score 1 · Answer 2 · answered Apr 02 '13 at 07:20

1

You can develop your own min/max algorithm using a divide and conquer method.

If you have the possibility to use npp, then this function may be useful : nppsMinMax_32f.

answered Apr 02 '13 at 07:20

simon.denel

780
6
23

alrikai · Accepted Answer · 2013-04-03T00:39:23.253

Based on your comment to your question, you were trying to find the max and min values while calculating them; while it's possible, it's not the most efficient. If you're set on doing that, then you can have an atomic comparison against some global minimum and global maximum, with the downside that each thread will be serialized, which will likely be a significant bottleneck.

For the more canonical approach to finding the maximum or minimum in an array via a reduction, you can do something along the lines of:

#define MAX_NEG ... //some small number

template <typename T, int BLKSZ> __global__
void cu_max_reduce(const T* d_data, const int d_len, T* max_val)
{
    volatile __shared__ T smem[BLKSZ];

    const int tid = threadIdx.x;
    const int bid = blockIdx.x;
        //starting index for each block to begin loading the input data into shared memory
    const int bid_sidx = bid*BLKSZ;

    //load the input data to smem, with padding if needed. each thread handles 2 elements
    #pragma unroll
    for (int i = 0; i < 2; i++)
    {
                //get the index for the thread to load into shared memory
        const int tid_idx = 2*tid + i;
        const int ld_idx = bid_sidx + tid_idx;
        if(ld_idx < (bid+1)*BLKSZ && ld_idx < d_len)
            smem[tid_idx] = d_data[ld_idx];
        else
            smem[tid_idx] = MAX_NEG;

        __syncthreads();
    }

    //run the reduction per-block
    for (unsigned int stride = BLKSZ/2; stride > 0; stride >>= 1)
    {
        if(tid < stride)
        {
            smem[tid] = ((smem[tid] > smem[tid + stride]) ? smem[tid]:smem[tid + stride]);
        }
        __syncthreads();
    }

    //write the per-block result out from shared memory to global memory
    max_val[bid] = smem[0];
}


//assume we have d_data as a device pointer with our data, of length data_len
template <typename T> __host__
T cu_find_max(const T* d_data, const int data_len)
{
    //in your host code, invoke the kernel with something along the lines of:
    const int thread_per_block = 16; 
    const int elem_per_thread = 2;
    const int BLKSZ = elem_per_thread*thread_per_block; //number of elements to process per block
    const int blocks_per_grid = ceil((float)data_len/(BLKSZ));

    dim3 block_dim(thread_per_block, 1, 1);
    dim3 grid_dim(blocks_per_grid, 1, 1);

    T *d_max;
    cudaMalloc((void **)&d_max, sizeof(T)*blocks_per_grid); 

    cu_max_reduce <T, BLKSZ> <<<grid_dim, block_dim>>> (d_data, data_len, d_max);

    //etc....
}

This will find the per-block maximum value. You can run it again on its output (e.g. with d_max as the input data and with updated launch parameters) on 1 block to find the global maximum - running it in a multi-pass manner like this is necessary if your dataset is too large (in this case, above 2 * 4096 elements, since we have each thread process 2 elements, although you could just process more elements per thread to increase this).

I should point out that this isn't particularly efficient (you'd want to use a more intelligent stride when loading the shared memory to avoid bank conflicts), and I'm not 100% sure it's correct (it worked on a few small testcases I tried), but I tried to write it for maximal clarity. Also don't forget to put in some error checking code to make sure your CUDA calls are completing successfully, I left them out here to keep it short(er).

I should also direct you towards some more in-depth documentation; you can take a look at the CUDA sample reduction over at http://docs.nvidia.com/cuda/cuda-samples/index.html although it's not doing a min/max calculation, it's the same general idea (and more efficient). Also, if you're looking for simplicity, you might just want to use Thrust's functions thrust::max_element and thrust::min_element, and the documentation at: thrust.github.com/doc/group__extrema.html

Returning the minimum and maximum elements of an array in CUDA

3 Answers3