CUDA kernels not launching before CudaDeviceSynchronize

Question

cuda profiler output:

I am having some trouble with concurrent CUDA. Take a look at the attached image. The kernel is launched at the marked point, at 0.395 seconds. Then there is some green CpuWork. Finally, there is a call to cudaDeviceSynchronize. The kernels that is launched before CpuWork doesnt start before the synchronize call. Ideally, it should run in parallel with the CPU work.

void KdTreeGpu::traceRaysOnGpuAsync(int firstRayIndex, int numRays, int rank, int buffer)
{
    int per_block = 128;
    int num_blocks = numRays/per_block + (numRays%per_block==0?0:1);

    Ray* rays = &this->deviceRayPtr[firstRayIndex];
    int* outputHitPanelIds = &this->deviceHitPanelIdPtr[firstRayIndex];

    kdTreeTraversal<<<num_blocks, per_block, 0>>>(sceneBoundingBox, rays, deviceNodesPtr, deviceTrianglesListPtr, 
                                                firstRayIndex, numRays, rank, rootNodeIndex, 
                                                deviceTHitPtr, outputHitPanelIds, deviceReflectionPtr);

    CUDA_VALIDATE(cudaMemcpyAsync(resultHitDistances[buffer], deviceTHitPtr, numRays*sizeof(double), cudaMemcpyDeviceToHost));
    CUDA_VALIDATE(cudaMemcpyAsync(resultHitPanelIds[buffer], outputHitPanelIds, numRays*sizeof(int), cudaMemcpyDeviceToHost));
    CUDA_VALIDATE(cudaMemcpyAsync(resultReflections[buffer], deviceReflectionPtr, numRays*sizeof(Vector3), cudaMemcpyDeviceToHost));
}

The memcopies are async. The result buffers are allocated like this

unsigned int flag = cudaHostAllocPortable;

CUDA_VALIDATE(cudaHostAlloc(&resultHitPanelIds[0], MAX_RAYS_PER_ITERATION*sizeof(int), flag));
CUDA_VALIDATE(cudaHostAlloc(&resultHitPanelIds[1], MAX_RAYS_PER_ITERATION*sizeof(int), flag));

Hoping for a solution for this. Have tried many things, including not running in the default stream. When i added cudaHostAlloc i recognized that the async method returned back to the CPU. But that doesnt help when the kernel does not launch before the deviceSynchronize call later.

resultHitDistances[2] contains two allocated memory areas so that when 0 is read by the CPU, the GPU should put the result in 1.

Thanks!

Edit: This is the code that calls traceRaysAsync.

int numIterations = ceil(float(this->numPrimaryRays) / MAX_RAYS_PER_ITERATION);
int numRaysPrevious = min(MAX_RAYS_PER_ITERATION, this->numPrimaryRays);
nvtxRangePushA("traceRaysOnGpuAsync First");
traceRaysOnGpuAsync(0, numRaysPrevious, rank, 0);
nvtxRangePop();

for(int iteration = 0; iteration < numIterations; iteration++)
{

    int rayFrom = (iteration+1)*MAX_RAYS_PER_ITERATION;
    int rayTo = min((iteration+2)*MAX_RAYS_PER_ITERATION, this->numPrimaryRays) - 1;
    int numRaysIteration = rayTo-rayFrom+1;

    // Wait for results to finish and get them

    waitForGpu();
    // Trace the next iteration asynchronously. This will have data prepared for next iteration

    if(numRaysIteration > 0)
    {
        int nextBuffer = (iteration+1) % 2;
        nvtxRangePushA("traceRaysOnGpuAsync Interior");
        traceRaysOnGpuAsync(rayFrom, numRaysIteration, rank, nextBuffer);
        nvtxRangePop();
    }
    nvtxRangePushA("CpuWork");

    // Store results for current iteration

    int rayOffset = iteration*MAX_RAYS_PER_ITERATION;
    int buffer = iteration % 2;

    for(int i = 0; i < numRaysPrevious; i++)
    {
        if(this->activeRays[rayOffset+i] && resultHitPanelIds[buffer][i] >= 0)
        {
            this->activeRays[rayOffset+i] = false;
            const TrianglePanelPair & t = this->getTriangle(resultHitPanelIds[buffer][i]);
            double hitT = resultHitDistances[buffer][i];

            Vector3 reflectedDirection = resultReflections[buffer][i];

            Result res = Result(rays[rayOffset+i], hitT, t.panel);
            results[rank].push_back(res);
            t.panel->incrementIntensity(1.0);

            if (t.panel->getParent().absorbtion < 1)
            {
                numberOfRaysGenerated++;

                Ray reflected (res.endPoint() + 0.00001*reflectedDirection, reflectedDirection);

                this->newRays[rayOffset+i] = reflected;
                this->activeRays[rayOffset+i] = true;
                numNewRays++;

            }
        }



    }

    numRaysPrevious = numRaysIteration;

    nvtxRangePop();

}

You're not showing the code right after your KdTreeGpu::traceRaysOnGpuAsync call, but that would probably be useful, for example to see where and why you are using a cudaDeviceSynchronize() call? I think you're issuing the devicesync right after your call to KdTreeGpu::traceRaysOnGpuAsync, but that will kill your overlap. This is the area where you want the overlap, and assuming the 2nd green CpuWork bar doesn't depend on the results from kdTreeTraversal, then you want to move or eliminate that devicesync right afer your kernel function call. re-factor some CpuWork *before* the devicesync. — Robert Crovella, Nov 26 '12 at 16:48
I added some more code which has been cleaned of some timers so it should be easier to follow. The two buffers should make CpuWork independent of the kernel launch. — apartridge, Nov 26 '12 at 16:56

score 6 · Accepted Answer · answered Nov 26 '12 at 17:32

6

This is the expected behavior on Windows with the WDDM driver model, where the driver tries to mitigate the kernel launch overhead by trying to batch kernel launches. Try inserting cudaStreamQuery(0) straight after the kernel invocation to trigger early launching of the kernel before the batch is full.

answered Nov 26 '12 at 17:32

tera

7,080
1
21
32

1

To avoid performance issues with the WDDM driver model, consider switching to the TCC driver. – njuffa Nov 26 '12 at 21:11
One straight after the kernel and one after the two memcopy's did the trick. With just one after the kernel, memcpy was delayed until the syncrhonize. Now, it is in proper parallel. Thanks! – apartridge Nov 27 '12 at 10:49

CUDA kernels not launching before CudaDeviceSynchronize

1 Answers1

Linked