Storing only value from loop that has iterated the fewest times

Question

I have N iterations and L memory locations. Each iteration is writing to any of 0 to (L-1) memory locations.

I want to store data in global memory based on iteration number. Suppose iterations K and K-1 both write to the same location. The final result in global memory should then be K. Thread K-1 should not override the result of thread K.

I have tried the solution below. I create an array with maximum iterations for each memory location and initialize to -1. Then I check if the number of iterations is more than the current value. If it is, I store the value and update the maximum iterations.

This is my current code but it is not giving correct results for larger iterations.

#include<stdio.h>
#include"cuda.h"
__global__ void  fun(int *A,int *maxIndex,int *index1,int *lock)
{
                    int threadid=blockIdx.x*blockDim.x+threadIdx.x;
                    int iteration_no=threadid;
                    int index=index1[threadid];
                    int exitFromLoop=1;
                    while(exitFromLoop==1)
                   {
                            int flag=atomicCAS(&lock[index],0,1);
                     if(flag==0)


                           {
                                    if(maxIndex[index]<iteration_no)
                                    {       
                                            A[index]=89;
                                            maxIndex[index]=iteration_no;
                                             __threadfence();
                                    }
                                    else
                                    {
                                    }
                                    //__threadfence();
                                    lock[index]=0;
                                    exitFromLoop=0;
                            }
                            else
                            {

                            }
                    }

}

int main()
{
    int A[10]={10,20,30,40,50,60,70,80,90,100};
    int maxIndex[10]={-1,-1,-1,-1,-1,-1,-1,-1,-1,-1};
    int lock[10]={0,0,0,0,0,0,0,0,0,0};
    int index[8192];
    srand(0);
    for(int ii=0;ii<8192;ii++)
    {
            index[ii]=rand()%10;
    }
    int *index1;
    int *A1,*maxIndex1;
    int *lock1;
     cudaMalloc((void**)&lock1,sizeof(int)*10);
    cudaMalloc((void**)&A1,sizeof(int)*10);
    cudaMalloc((void**)&index1,sizeof(int)*8192);
    cudaMalloc((void**)&maxIndex1,sizeof(int)*10);
    cudaMemcpy(A1,&A,sizeof(int)*10,cudaMemcpyHostToDevice);
    cudaMemcpy(lock1,&lock,sizeof(int)*10,cudaMemcpyHostToDevice);
    cudaMemcpy(maxIndex1,&maxIndex,sizeof(int)*10,cudaMemcpyHostToDevice);
    cudaMemcpy(index1,&index,sizeof(int)*8192,cudaMemcpyHostToDevice);
    fun<<<16,512>>>(A1,maxIndex1,index1,lock1);
    cudaMemcpy(&A,A1,sizeof(int)*10,cudaMemcpyDeviceToHost);
    cudaMemcpy(&maxIndex,maxIndex1,sizeof(int)*10,cudaMemcpyDeviceToHost);
    printf("\nindex \n");
    for(int i=0;i<8192;i++)
    {
            printf("%d\n",index[i]);
    }

    for(int i=0;i<10;i++)
    {
            printf(" %d max is %d\n",A[i],maxIndex[i]);
    }
}

Can you revise the question text? It's confusing and unclear. Here if people understand you, they will certainly help you. — Farzad, Feb 12 '14 at 18:21
Actually i have n iterations and i want to commit data into one array which is in global memory and commit should be like this suppose that nth iteration and n-1 both are writing to ith index then final would be by nth so keep nth result and override result written by n-1 iteration.take an example of loop and in loop what would be final result that actually i want in o/p — user3279286, Feb 12 '14 at 18:32
@user3279286, does the question still correctly describe what you're trying to do? — Roger Dahl, Feb 13 '14 at 01:37
@RogerDahl hey i modify code and put complete code and now actually when u see on i th index iteration number nth should write finally but here at the place of nth any iteration before it is writing which is wrong.take an example on 2nd index of array 56 and 89 both iterations are writing then at the end on 2nd index value should be by 89 iteration not by 56.it should be override by 89.now i think it is more clear what i want — user3279286, Feb 13 '14 at 12:03
@all i modify code and put complete code and now actually when u see on i th index iteration number nth should write finally but here at the place of nth any iteration before it is writing which is wrong.take an example on 2nd index of array 56 and 89 both iterations are writing then at the end on 2nd index value should be by 89 iteration not by 56.it should be override by 89.now i think it is more clear what i want — user3279286, Feb 13 '14 at 12:52
@user3279286, Now that you've put this much code in, could you put it all together to a complete, compilable example? — Roger Dahl, Feb 13 '14 at 14:07
@RogerDahl i am not getting i put complete code what do you want? — user3279286, Feb 13 '14 at 14:25
@user3279286, I'm sorry, due to an error in the formatting, I thought that you had provided two separate code sections. I will try to take a look at this today. — Roger Dahl, Feb 13 '14 at 14:42
@RogerDahl,please reply as early as possible.i need it urgent.please help me soon — user3279286, Feb 14 '14 at 05:52

Roger Dahl · Answer 1 · 2014-02-14T07:05:24.830

1

I think this might be what you're after.

For each element in the A array, there is a corresponding element in the maxIndex array that contains iteration_no of the last thread that updated the element in the A array. If the current thread has a higher iteration_no than that, the maxIndex gets updated to the iteration_no of the current thread, and the thread updates the element in A.

If the current thread has a lower iteration_no, the A element does not get updated and the iteration_no in maxIndex does not get updated.

#include<stdio.h>
#include"cuda.h"

__global__ void  fun(int *A,int *maxIndex,int *index)
{
  int iteration_no=blockIdx.x*blockDim.x+threadIdx.x;
  int i=index[iteration_no];
  if (atomicMax(maxIndex + i, iteration_no) < iteration_no) {
    A[i] = 89;
  }
}

int main()
{
  int A[10] = {10,20,30,40,50,60,70,80,90,100};
  int maxIndex[10]={-1};
  int index[8192];
  srand(0);
  for(int ii=0;ii<8192;ii++)
  {
    index[ii]=rand()%10;
  }
  int *index_d;
  int *A_d,*maxIndex_d;
  cudaMalloc((void**)&A_d,sizeof(int)*10);
  cudaMalloc((void**)&index_d,sizeof(int)*8192);
  cudaMalloc((void**)&maxIndex_d,sizeof(int)*10);
  cudaMemcpy(A_d,&A,sizeof(int)*10,cudaMemcpyHostToDevice);
  cudaMemcpy(maxIndex_d,&maxIndex,sizeof(int)*10,cudaMemcpyHostToDevice);
  cudaMemcpy(index_d,&index,sizeof(int)*8192,cudaMemcpyHostToDevice);

  fun<<<16,512>>>(A_d,maxIndex_d,index_d);

  cudaMemcpy(&A,A_d,sizeof(int)*10,cudaMemcpyDeviceToHost);
  cudaMemcpy(&maxIndex,maxIndex_d,sizeof(int)*10,cudaMemcpyDeviceToHost);
  printf("\nindex \n");

  for(int i=0;i<8192;i++) {
    printf("%d\n",index[i]);
  }

  for(int i=0;i<10;i++) {
    printf(" %d max is %d\n",A[i],maxIndex[i]);
  }
}

edited Feb 14 '14 at 07:05

answered Feb 14 '14 at 06:48

Roger Dahl

15,132
8
62
82

can u please explain it – user3279286 Feb 14 '14 at 07:00
this is exactly what i want sir – user3279286 Feb 14 '14 at 07:07
now this code works correct for any no of threads and blocks and size of array or u will revise it.i tried it for some it is giving correct – user3279286 Feb 14 '14 at 07:08
It's done. I did a small revision, where I changed the starting point in maxIndex from 0 to -1, to make sure that thread 0 would be able to make an update as well. – Roger Dahl Feb 14 '14 at 07:12
thank u so much sir.i try to figure out what is problem with my code.i got to know that suppose at maxIndex 999 is written by 2 threads will read this value at the same time suppose 1000 and 1089 then if thread 1000 write later then final result will be written by 1000 and if thread 1089 writes later then final result would be by this thread but actually we are using lock then why this happening two threads reading value of same maxIndex[k] at the same time? – user3279286 Feb 14 '14 at 07:18
Dahi i have stored 89 at each location in the memory but if we take one array of size equal to the no of iterations then store this array's value to each means array[iteration_no] at index A[index] then it will give wrong answer – user3279286 Feb 14 '14 at 11:02
You're right, there's a race condition in the code. The problem is that it does not prevent two concurrently running threads from updating the same value. A critical section is needed. Try updating the code based on this: http://stackoverflow.com/a/18968893/442006 – Roger Dahl Feb 14 '14 at 14:12
I think an overall better solution to this might be some type of a scan or stream compaction done in a second kernel. – Roger Dahl Feb 14 '14 at 15:06
Dahi .actually problem with this link stackoverflow.com/a/18968893/442006 is that here when 2 threads in the same warp try to access then it does into deadlock state .i tried this already.can't u suggest some other way link in my previous written code (code that i posted) replace maxIndex[index] – user3279286 Feb 16 '14 at 05:22
I see two possible solutions. You can modify the example in the CUDA C Programming Guide that uses `atomicCAS()` to implement a double precision `atomicAdd()` or you use two steps. The first is a kernel that just writes the results from each thread to a separate location (no overlap or concurrency issues) and the second is key based scan or reduce, where the key is the slot. There's probably a function in [CUB](http://nvlabs.github.io/cub/index.html) that can be used for this. I will write the function unless you write it first. But it will be a couple of days before I have time. – Roger Dahl Feb 16 '14 at 20:17
@user3279286 you said "problem with this link stackoverflow.com/a/18968893/442006 is that here when 2 threads in the same warp try to access then it does into deadlock state". That mechanism is *not designed* to arbitrate between threads in a warp. That is why the critical section is bracketed by `__syncthreads()` and checking if `threadIdx.x` is zero. Thread zero in each block arbitrates for access to the critical section. Multiple threads per block access can then be safely negotiated using ordinary block synchronization methods (i.e. shared memory and `__syncthreads()`). *Read* the answer – Robert Crovella Feb 16 '14 at 21:18
@RobertCrovella, thanks for spotting this. I also assumed thread based critical sections when I found the question. So a third way to implement this is block based critical sections and block synchronization. – Roger Dahl Feb 16 '14 at 22:13

Storing only value from loop that has iterated the fewest times

1 Answers1