Dear DOWNVOTERS: kindly let me know the reason of down vote. I have already accepted an answer which means that the person was able to understand the problem and a minimal working example was not required. Secondly, I wanted it to be a conceptual question rather than a homework problem. Please let me know the reason of your down-vote.
IMPORTANT: I have already read several thread (for example this) about the distribution of threads and blocks but I have a specific query.
I have to process an image data in unsigned char
form at GPU. My image is of size (1080 x 1920)
with 3 channels and each pixel is of unsigned char
type.
GPU Details:
NVIDIA Quadro k2000
2 GB of GDDR5 GPU memory
384 5MX CUDA parallel processing cores
As, I am new to GPU processing, I am not able to understand much about the number of threads per block and total number of block distribution for my GPU card in this specific case.
PROBLEM: When I use the following
configuration for my (1080 x 1920)
image to call the GPU kernel then, I am getting the desired results but the computational time is too much
dim3 numOfBlocks( (108) , (192) );
dim3 numOfThreadsPerBlocks( 3*10 , 3*10 ); //multiplied by 3 because we have 3 channel image now
colorTransformation_kernel<<<numOfBlocks, numOfThreadsPerBlocks>>>(numChannels, step_size, iw, ih, dev_ptr_source, dev_ptr_dst);
but, if I choose to have the following another configuration
dim3 numOfBlocks( (108/2) , (192/2) );
dim3 numOfThreadsPerBlocks( 3*10*2 , 3*10*2 ); //multiplied by 3 because we have 3 channel image now
then, I get a blank image.