4

I have Tried to Implement the HAAR wavelet transform in CUDA for a 1D array.

ALGORITHM

I have 8 indices in the input array

With this condition if(x_index>=o_width/2 || y_index>=o_height/2) I will have 4 threads which should be 0,2,4,6 and I plan to handletwo indices in the input with each one of them.

I calculate the avg.EG: if my thread id is '0'-then avg is (input[0]+input[1])/2 and then at the same time i get the diff which would be input[0]-avg and so on for the rest of the threads.

NOW important thing is the placement of the output.I created a separate thread_id for the output as using indices 0,2,4,6 was creating difficulties with placement of the output in the correct index.

My avgs should be placed in the first 4 indices i.e 0,1,2,3 of the output-and o_thread_id should be 0,1,2,3. Similarly,to place differences at 4,5,6,7 I have incremented 0,1,2,3 with '4' as shown in the code

PROBLEM

My output comes out as all zero!!! No matter what I change I am getting that.

CODE

__global__ void cal_haar(int input[],float output [],int i_widthstep,int o_widthstep,int o_width,int o_height)
{

    int x_index=blockIdx.x*blockDim.x+threadIdx.x;
    int y_index=blockIdx.y*blockDim.y+threadIdx.y;

    if(x_index>=o_width/2 || y_index>=o_height/2) return;

    int i_thread_id=y_index*i_widthstep+(2*x_index);
    int o_thread_id=y_index*o_widthstep+x_index;

    float avg=(input[i_thread_id]+input[i_thread_id+1])/2;
    float diff=input[i_thread_id]-avg;
    output[o_thread_id]=avg;
    output[o_thread_id+4]=diff;

}

void haar(int input[],float output [],int i_widthstep,int o_widthstep,int o_width,int o_height)
{

    int * d_input;
    float * d_output;

    cudaMalloc(&d_input,i_widthstep*o_height);
    cudaMalloc(&d_output,o_widthstep*o_height);

    cudaMemcpy(d_input,input,i_widthstep*o_height,cudaMemcpyHostToDevice);

    dim3 blocksize(16,16);
    dim3 gridsize;
    gridsize.x=(o_width+blocksize.x-1)/blocksize.x;
    gridsize.y=(o_height+blocksize.y-1)/blocksize.y;

    cal_haar<<<gridsize,blocksize>>>(d_input,d_output,i_widthstep,o_widthstep,o_width,o_height);


    cudaMemcpy(output,d_output,o_widthstep*o_height,cudaMemcpyDeviceToHost);

    cudaFree(d_input);
    cudaFree(d_output);

}

The following is my main function:-

void main()
{
    int in_arr[8]={1,2,3,4,5,6,7,8};
    float out_arr[8];
    int i_widthstep=8*sizeof(int);
    int o_widthstep=8*sizeof(float);
    haar(in_arr,out_arr,i_widthstep,o_widthstep,8,1);

    for(int c=0;c<=7;c++)
    {cout<<out_arr[c]<<endl;}
    cvWaitKey();

}

Can you tell me where I am going wrong that it gives me zeros as output? Thank you.

talonmies
  • 70,661
  • 34
  • 192
  • 269
Code_Jamer
  • 913
  • 2
  • 9
  • 21
  • Did you try to use * instead of []? – geek May 23 '12 at 19:13
  • Sorry,I don't understand. Can you mention the specific line of code kindly? – Code_Jamer May 23 '12 at 19:20
  • For example this signature `global void cal_haar(int input[],float output [],int i_widthstep,float o_widthstep,int o_width,int o_height)` try to use pointers like this `global void cal_haar(int* input,float* output,int i_widthstep,float o_widthstep,int o_width,int o_height)`. – geek May 23 '12 at 19:23
  • yes marina, I did do that,still getting the same problem of all zeros in the output array. – Code_Jamer May 23 '12 at 19:28
  • Why is `o_widthstep` a float? And what are `i_widthstep`, `o_width` and `o_height`? If they are dimensions of the inputs in elements, why are the `cudaMalloc` and `cudaMemcpy` calls not including the size of the types? There is a lot wrong here. All zero output probably means the kernel never runs because of other errors. Why does your code include no error checking? – talonmies May 23 '12 at 19:43
  • Yes you are right o_widthstep need not be float.It should be int.But changing that also makes no difference. I have taken into account the size of the types in my main function listed below:- o_width and o_height are both output width and output height. – Code_Jamer May 23 '12 at 19:51
  • 1
    Include additional relevant information in your question. It's very difficult to read and follow in a comment. – Bart May 23 '12 at 20:06
  • Right.I included the additional information in the question. – Code_Jamer May 23 '12 at 20:18
  • @asd: please take more care formatting code when you post it. This is the second time I have had to edit this question to badly or wrongly formatted code. – talonmies May 23 '12 at 20:24
  • Ok.My apologies for the inconvenience. – Code_Jamer May 23 '12 at 20:27
  • 1
    OK, so widthstep is in bytes, which takes care of malloc and memcpy. But in the kernel you use it to calculate an index into your float array, which means you'll be accessing memory which has not been allocated. And follow @talonmies advice and add some error checks. Also see the cuda-memcheck tool to help you find mistakes like this. – Peter May 23 '12 at 20:33

1 Answers1

5

The problem with your code is the following condition:

if(x_index>=o_width/2 || y_index>=o_height/2) return;

Given o_height = 1, we have o_height/2 = 0 (o_height is int, so we have integer division here with rounding down), so no threads perform any operations. To achieve what you want you can either do floating-point arithmetics here, or use (o_height+1)/2 and (o_width+1)/2: it would perform division with "arithmetic" rounding (you will have ( x_index >= (8+1)/2 /*= 4*/ && y_index >= (1+1)/2 /*= 1*/ ))

Besides, there is problem with addressing when you have more than 1 thread in Y-dimension, since then you i_thread_id and o_thread_id calculations would be wrong (_withstep is size in bytes, yet you use it as array index).

aland
  • 4,829
  • 2
  • 24
  • 42
  • I am using i_widthstep to calculate thread id within a block from the global x_index and y_index and for that I will need to use widthstep which is definitely in bytes.And my output is also correct for the 1D array,so I assume that's the way to do it.But yes there was the problem with the integer division you mentioned. – Code_Jamer May 23 '12 at 20:57