I have a problem concerning some kind of reduction in CUDA.
distance
is a matrix with gridSize*numberOfAngles
elements, fftData
is a matrix with numberOfAngles*NFFT
elements. grid_magnitude
is the result-matrix where I want to store my calculation result and it has gridSize
elements.
I want to calculate the index in fftData
that corresponds to one specific value in distance
. After that, the value at this index in fftData
is added to grid_magnitude
at the corresponding gridPoint
.
This is my code so far:
__global__ void calcBackProjection(cuFloatComplex* fftData,
cuFloatComplex* grid_magnitude,
float* distance,
int gridSize,
int numberOfAngles,
float rangeBin,
int NFFT)
{
int gridPointIdx = threadIdx.x + blockIdx.x*blockDim.x;
while(gridPointIdx < gridSize)
{
for(int angleIdx = 0; angleIdx < numberOfAngles; angleIdx++)
{
//find rangeBin in fftData corresponding to distance
float curDistance = distance[gridPointIdx + angleIdx*gridSize];
int wantedIdx = floor(curDistance / rangeBin);
if(wantedIdx < NFFT)
{
grid_magnitude[gridPointIdx + angleIdx*gridSize] =
addCmplx(fftData[wantedIdx + angleIdx*NFFT], grid_magnitude[gridPointIdx +
angleIdx*gridSize]);
}
}
gridPointIdx += blockDim.x * gridDim.x;
}
}
gridPointIdx
should be unique for every thread and so each thread should be writing at a different location in grid_magnitude
. But it seems that this is not working, because no change is made on grid_magnitude
.
What am I missing?
I didn't manage to do this in full parallel 2-dimensional indexing, maybe I should use shared memory, but how do I part grid_magnitude to be used partly by the threads?