Based on your comment to your question, you were trying to find the max and min values while calculating them; while it's possible, it's not the most efficient. If you're set on doing that, then you can have an atomic comparison against some global minimum and global maximum, with the downside that each thread will be serialized, which will likely be a significant bottleneck.
For the more canonical approach to finding the maximum or minimum in an array via a reduction, you can do something along the lines of:
#define MAX_NEG ... //some small number
template <typename T, int BLKSZ> __global__
void cu_max_reduce(const T* d_data, const int d_len, T* max_val)
{
volatile __shared__ T smem[BLKSZ];
const int tid = threadIdx.x;
const int bid = blockIdx.x;
//starting index for each block to begin loading the input data into shared memory
const int bid_sidx = bid*BLKSZ;
//load the input data to smem, with padding if needed. each thread handles 2 elements
#pragma unroll
for (int i = 0; i < 2; i++)
{
//get the index for the thread to load into shared memory
const int tid_idx = 2*tid + i;
const int ld_idx = bid_sidx + tid_idx;
if(ld_idx < (bid+1)*BLKSZ && ld_idx < d_len)
smem[tid_idx] = d_data[ld_idx];
else
smem[tid_idx] = MAX_NEG;
__syncthreads();
}
//run the reduction per-block
for (unsigned int stride = BLKSZ/2; stride > 0; stride >>= 1)
{
if(tid < stride)
{
smem[tid] = ((smem[tid] > smem[tid + stride]) ? smem[tid]:smem[tid + stride]);
}
__syncthreads();
}
//write the per-block result out from shared memory to global memory
max_val[bid] = smem[0];
}
//assume we have d_data as a device pointer with our data, of length data_len
template <typename T> __host__
T cu_find_max(const T* d_data, const int data_len)
{
//in your host code, invoke the kernel with something along the lines of:
const int thread_per_block = 16;
const int elem_per_thread = 2;
const int BLKSZ = elem_per_thread*thread_per_block; //number of elements to process per block
const int blocks_per_grid = ceil((float)data_len/(BLKSZ));
dim3 block_dim(thread_per_block, 1, 1);
dim3 grid_dim(blocks_per_grid, 1, 1);
T *d_max;
cudaMalloc((void **)&d_max, sizeof(T)*blocks_per_grid);
cu_max_reduce <T, BLKSZ> <<<grid_dim, block_dim>>> (d_data, data_len, d_max);
//etc....
}
This will find the per-block maximum value. You can run it again on its output (e.g. with d_max as the input data and with updated launch parameters) on 1 block to find the global maximum - running it in a multi-pass manner like this is necessary if your dataset is too large (in this case, above 2 * 4096 elements, since we have each thread process 2 elements, although you could just process more elements per thread to increase this).
I should point out that this isn't particularly efficient (you'd want to use a more intelligent stride when loading the shared memory to avoid bank conflicts), and I'm not 100% sure it's correct (it worked on a few small testcases I tried), but I tried to write it for maximal clarity. Also don't forget to put in some error checking code to make sure your CUDA calls are completing successfully, I left them out here to keep it short(er).
I should also direct you towards some more in-depth documentation; you can take a look at the CUDA sample reduction over at http://docs.nvidia.com/cuda/cuda-samples/index.html although it's not doing a min/max calculation, it's the same general idea (and more efficient). Also, if you're looking for simplicity, you might just want to use Thrust's functions thrust::max_element
and thrust::min_element
, and the documentation at: thrust.github.com/doc/group__extrema.html