Odd-Even Sort using cuda Programming

Question

I'm trying to implement odd-even sort program in cuda-c language. But, whenever I give a 0 as one of the elements in the input array, the resulted array is not properly sorted.In other cases, however, it is working for other input.I don't understand what is the problem with the code.Here is my code:

#include<stdio.h>
#include<cuda.h>
#define N 5

__global__ void sort(int *c,int *count)
{
    int l;
    if(*count%2==0)
          l=*count/2;
    else
         l=(*count/2)+1;
    for(int i=0;i<l;i++)
    {
            if(threadIdx.x%2==0)  //even phase
            {
                if(c[threadIdx.x]>c[threadIdx.x+1])
                {
                    int temp=c[threadIdx.x];
                    c[threadIdx.x]=c[threadIdx.x+1];
                    c[threadIdx.x+1]=temp;
                }

            __syncthreads();
            }
            else     //odd phase
            {
                if(c[threadIdx.x]>c[threadIdx.x+1])
                {
                    int temp=c[threadIdx.x];
                    c[threadIdx.x]=c[threadIdx.x+1];
                    c[threadIdx.x+1]=temp;
                }

            __syncthreads();
            }
    }//for

}



int main()
{int a[N],b[N],n;
    printf("enter size of array");
    scanf("%d",&n);
    print("enter the elements of array");
  for(int i=0;i<n;i++)
  {
    scanf("%d",&a[i]);
  }
  printf("ORIGINAL ARRAY : \n");
  for(int i=0;i<n;i++)
          {

          printf("%d ",a[i]);
          }
  int *c,*count;
  cudaMalloc((void**)&c,sizeof(int)*N);
  cudaMalloc((void**)&count,sizeof(int));
  cudaMemcpy(c,&a,sizeof(int)*N,cudaMemcpyHostToDevice);
  cudaMemcpy(count,&n,sizeof(int),cudaMemcpyHostToDevice);
  sort<<< 1,n >>>(c,count);
  cudaMemcpy(&b,c,sizeof(int)*N,cudaMemcpyDeviceToHost);
  printf("\nSORTED ARRAY : \n");
  for(int i=1;i<=n;i++)
      {
         printf("%d ",b[i]);
      }

}

What kind of sorting algorithm is this supposed to be? `O(n)` runtime would be impressive if it worked, especially for a comparison sort. What a surprise it doesn't. — EOF, Apr 11 '15 at 18:15
@EOF I make no claims about any of this except there appears to be some sort of conceptual similarity with what is presented [here](http://en.wikipedia.org/wiki/Odd%E2%80%93even_sort). — Robert Crovella, Apr 11 '15 at 18:55
@RobertCrovella: I understand you're not defending this question, but the important conceptual thing about comparison sorts is that they are at best`O(nlogn)`, which necessitates the two nested loops in your link. This question has no nested loop. — EOF, Apr 11 '15 at 19:28
@EOF The second/nested loop is missing because this is running on an inherently parallel, multithreaded architecture (CUDA). (And subject to the limitations I indicated in my answer related to threadblock.) And I don't think anyone mentioned `O(n)` except you. The link I provided suggests a worst case `O(n^2)` for naive even-odd sort. I didn't see anyone claim otherwise. — Robert Crovella, Apr 11 '15 at 19:35

score 4 · Accepted Answer · edited May 23 '17 at 12:22

Your kernel code had two main errors that I could see:

On the odd phase (for even length array, or even phase for odd length array), your last thread will index out of bounds at c[threadIdx.x+1]. For example, for 4 threads, they are numbered 0,1,2,3. Thread 3 is odd, but if you access c[3+1], that is not a defined element in your array. We can fix this by restricting each phase to work on all threads but the last one.
You were using __syncthreads() inside a conditional statement that would not allow all threads to reach the barrier. This is a coding error. Read the documentation. We can fix this by adjusting what code is inside the conditional regions.

In the main code, your final printout statements were indexing incorrectly:

for(int i=1;i<=n;i++)

that should be:

for(int i=0;i<n;i++)

You also have typo here:

print("enter the elements of array");

I assume that should be printf.

The following code has the above errors fixed, and seems to run correctly for me for arrays up to length 5 (your hardcoded limit on N). Even if you increased N, I'm not sure this would work beyond the size of a warp and certainly would not work beyond the threadblock size, but hopefully you are aware of that already(if not, read the doc link about __syncthreads()).

"Fixed" code:

#include<stdio.h>
#include<cuda.h>
#define N 5

#define intswap(A,B) {int temp=A;A=B;B=temp;}

__global__ void sort(int *c,int *count)
{
    int l;
    if(*count%2==0)
          l=*count/2;
    else
         l=(*count/2)+1;
    for(int i=0;i<l;i++)
    {
            if((!(threadIdx.x&1)) && (threadIdx.x<(*count-1)))  //even phase
            {
                if(c[threadIdx.x]>c[threadIdx.x+1])
                  intswap(c[threadIdx.x], c[threadIdx.x+1]);
            }

            __syncthreads();
            if((threadIdx.x&1) && (threadIdx.x<(*count-1)))     //odd phase
            {
                if(c[threadIdx.x]>c[threadIdx.x+1])
                  intswap(c[threadIdx.x], c[threadIdx.x+1]);
            }
            __syncthreads();
    }//for

}



int main()
{int a[N],b[N],n;
    printf("enter size of array");
    scanf("%d",&n);
    if (n > N) {printf("too large!\n"); return 1;}
    printf("enter the elements of array");
  for(int i=0;i<n;i++)
  {
    scanf("%d",&a[i]);
  }
  printf("ORIGINAL ARRAY : \n");
  for(int i=0;i<n;i++)
          {

          printf("%d ",a[i]);
          }
  int *c,*count;
  cudaMalloc((void**)&c,sizeof(int)*N);
  cudaMalloc((void**)&count,sizeof(int));
  cudaMemcpy(c,&a,sizeof(int)*N,cudaMemcpyHostToDevice);
  cudaMemcpy(count,&n,sizeof(int),cudaMemcpyHostToDevice);
  sort<<< 1,n >>>(c,count);
  cudaMemcpy(&b,c,sizeof(int)*N,cudaMemcpyDeviceToHost);
  printf("\nSORTED ARRAY : \n");
  for(int i=0;i<n;i++)
      {
         printf("%d ",b[i]);
      }

  printf("\n");
}

The usual recital about proper cuda error checking belongs here.

It's more-or-less equivalent to the OP's usage of `(threadIdx.x%2==0)` Either one is focused on the least significant bit of `threadIdx.x` to determine if it is an "odd" thread or an "even" thread. In theory, a bitwise AND (`&`) operation should be less expensive than a modulo (`%`) operation, but the compiler probably has an idiom for modulo-by-power-of-2, so in the final analysis there is probably no difference. It's a bit shorter way to express the same idea however. — Robert Crovella, Apr 12 '15 at 14:21

Odd-Even Sort using cuda Programming

1 Answers1