1

Sorry for my english. I have a cuda kernel which returns different result values from time to time. This kernel counts series sum. My kernel consists of 4 code parts. Let me explain a little how this kernel works. The first part distributes iterations between threads(I took it as source). The second code part shows how every thread counts halfsum. After the second part we must place __syncthreads() because after the second part we are starting to use shared memory. In the third part I'm getting the resulting sum of all threads in block and putting it to the thread which threadIdx.x equals 0(I took it as source @ page 22). In the fourth part Im getting the resulting sum of all thread blocks and putting it to dSum[0]

Did I place __syncthreads() correctly? Where is an error? why on 64 blocks and 768 threads it gives wrong result and on 768 blocks and 64 threads it gives correct result?

__global__ void sumSeries(double* dSum,int totalThreadNumber){  
    volatile  __shared__ double data[768];  
    int tid=threadIdx.x+blockIdx.x*blockDim.x;
    int myend;
    double var;
    //part_1 get tid's start iteration value and end iteration value.
    int mystart = (INT_MAX / totalThreadNumber) * tid;
    if (INT_MAX %  totalThreadNumber > tid)
    {
        mystart += tid;
        myend = mystart + (INT_MAX /  totalThreadNumber) + 1;
    }
    else
    {
        mystart += INT_MAX %  totalThreadNumber;
        myend = mystart + (INT_MAX /  totalThreadNumber);
    }
    //part_2 get halfsum
    data[threadIdx.x]=0;
    for (int i = mystart ; i < myend ; ++i){
            var=i;
            data[threadIdx.x] += (var*var+var+1)/(var*var*var+var*var+var+1);

    }   
    __syncthreads();

    //part_3 sum all results in every block
    for (int s=blockDim.x/2; s>32; s>>=1)
    {
        if (threadIdx.x < s)
            data[threadIdx.x] += data[threadIdx.x + s];
        __syncthreads();
    }
    if (threadIdx.x < 32)
    {
        data[threadIdx.x] += data[threadIdx.x + 32];
        data[threadIdx.x] += data[threadIdx.x + 16];
        data[threadIdx.x] += data[threadIdx.x + 8];
        data[threadIdx.x] += data[threadIdx.x + 4];
        data[threadIdx.x] += data[threadIdx.x + 2];
        data[threadIdx.x] += data[threadIdx.x + 1];
    }

    if (threadIdx.x==0)
    {
        dSum[blockIdx.x]=data[0];
    }
    __syncthreads();
    //part_4
    if (tid==0)
        for (int t=1;t<8;++t)
            dSum[0]=dSum[0]+dSum[t];
}
maxi
  • 51
  • 1
  • 7
  • are you doing [proper cuda error checking](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api)? What happens when you run your code with `cuda-memcheck` ? What sort of GPU are you running on, and what is your `nvcc` compile command line? – Robert Crovella Mar 04 '14 at 21:00

1 Answers1

3

So your sum is the series over

(n^2+n+1)/(n^3+n^2+n+1) = (n^3-1)/(n^4-1)

This has the same convergence like the harmonic series over

1/n

namely none, it is, very very slowly, diverging towards infinity. The sum from 1 to N has a value between log(N) and 1-log(2)+log(N+1).

The result of any finite summation of these series is very sensible with regard to the order of summation. Summing forward from 1 to N and decreasing suppresses all terms where 1==1+1/n, which happens at a rather small number for floats. Summing backwards from some N to 1 will accumulate the small numbers first and preserve their cumulative contribution.

So depending on the order of arrival of the partial sums, especially when the sum containing 1 comes in, the total sum will show noticeable differences.


Both terms are monotonically decreasing in

f(x) = (x^2+x+1)/(x^3+x^2+x+1) = 0.5/(x+1)+0.5*(x+1)/(x^2+1)

The anti-derivative of that function is

F(n) = 0.5*ln(x+1)+0.25*ln(x^2+1)+0.5*arctan(x)

so that

f(n+1) <= F(n+1)-F(n) <= f(n) <= F(n)-F(n-1)

summing this up results in

F(N+1)-F(m) <= sum(n=m to N) f(n) <= F(N)-F(m-1)

To this one has to add the initial part of the sum in all three of the terms.

So set m=1000, computeS=sum(n=0 to 999) f(n)`, then

S+F(2^32  )-F(1000) = 23.459829390459243
S+F(2^32-1)-F( 999) = 23.460829890558995

are the upper and lower bounds for a summation from 0 to 2^32-1, far away from any of the numerical results.

Lutz Lehmann
  • 25,219
  • 2
  • 22
  • 51
  • I understand. I think that the result must be the same because on every cuda launch the 0 thread will count elements from 0 to 349525, the second will count 349526 to ... Every launch they summing in one way.. I launched mpi analogue on 192 threads. It returned me the same result as sequel anague – maxi Mar 04 '14 at 20:09
  • can you explain why on 64 blocks and 768 threads it gives wrong result and on 768 blocks and 64 threads it gives correct result? – maxi Mar 04 '14 at 20:49
  • As I understand it, the threads will run in parallel. Do you control in which order the results of the threads arrive? Do you sum up the results as they arrive or are they stored until the last arrives and then summed up? Preferably from last to first. – Lutz Lehmann Mar 04 '14 at 21:14
  • It's in the code, yes to all, you do. The summation is also drifting for the back to the front, so there should be no avoidable floating point cancellations. I do not understand parallel computing good enough to see what could go wrong. – Lutz Lehmann Mar 04 '14 at 21:23
  • Lutzl, I thing that you are correct.. When Im increasing number of blocks from 1 or2(sum=21.7276) thread blocks to 1024(sum=16.1234)-8128(sum=14.8549) I see that series sum becomes lower and lower. Did you mean it? – maxi Mar 04 '14 at 22:00
  • What could happen is that due to the binary tree structure many of the small thread sums move too fast to the front to get summed up with the large thread sums, before they could accumulate with thread sums of the same size. This would be not so important for larger block sizes with fewer threads, and possibly lead to too many improper cancellations for a bigger number of threads. – Lutz Lehmann Mar 04 '14 at 22:11