how to parallelize dct (for loops) in cuda

Question

How to parallelize four nested for loops in cuda
in case of dct i have four nested for loops i want my dct function in cuda code

for(y = 0; y < HEIGHT; y+=BLOCK_H) {
for(x = 0; x < WIDTH; x+= BLOCK_W) {
for(i = 0; i < BLOCK_H; i++) {
for(j = 0; j < BLOCK_W; j++) {
block_in[i][j] = cur_frame[(x+j)+(WIDTH*(y+i))];
}
}
}
}

[Here's](http://stackoverflow.com/questions/37708101/parallelize-four-and-more-nested-loops-with-cuda) a question which asks about how to parallelize nested loops . [Here's](http://stackoverflow.com/questions/34529387/kernel-for-processing-a-4d-tensor-in-cuda) another one. — Robert Crovella, Jun 11 '16 at 20:35

score 3 · Answer 1 · answered Jun 11 '16 at 05:40

3

There is a white paper from Nvidia, Obukov and Kharlamov: Discrete Cosine Transform for 8x8 Blocks with CUDA that goes with dct8x8 in the CUDA samples. You should have a look at both.

answered Jun 11 '16 at 05:40

merelyMark

367
3
9

i read that but i didn't understand how they are doing this line in C SubroutineDCTvector((float *)fSrc + (i+k) * Stride + j, 1, fDst + (i+k) * Stride + j, 1); – Noreen Bibi Jun 11 '16 at 15:17
And this in Cuda int OffsThreadInRow = threadIdx.y * BLOCK_SIZE + threadIdx.x; int OffsThreadInCol = threadIdx.z * BLOCK_SIZE; src += FMUL(blockIdx.y * KER2_BLOCK_HEIGHT + OffsThreadInCol, ImgStride) + blockIdx.x * KER2_BLOCK_WIDTH + OffsThreadInRow; dst += FMUL(blockIdx.y * KER2_BLOCK_HEIGHT + OffsThreadInCol, ImgStride) + blockIdx.x * KER2_BLOCK_WIDTH + OffsThreadInRow; float *bl_ptr = block + OffsThreadInCol * KER2_SMEMBLOCK_STRIDE + OffsThreadInRow; – Noreen Bibi Jun 11 '16 at 15:19

how to parallelize dct (for loops) in cuda

1 Answers1