0

I am basically looking for a way to synchronize a stream from within the device. I want to avoid using cudaDeviceSynchronize(), as it would serialize execution of my kernel that I want to execute concurrently using streams;

More detailed description: I have written a kernel, that is a stabilized bi-conjugate gradient solver. I want to lunch this kernel concurrently on different data using streams.

This kernel uses cublas functions. They are called from within the kernel.

One of operations required by the solver is calculation of a dot product of two vectors. This can be done with cublasdot(). But as this call is synchronous, execution of kernels in different streams get serialized. Instead of calling a dot product function, I calculate the dot product using cublasspmv(), which is called asynchronously. The problem is that this function returns before the result is calculated. I want therefore to synchronize the stream from the device - I am looking for an equivalent of cudaStreamSynchronize() but callable from the device.

__device__ float _cDdot(cublasHandle_t & cublasHandle, const int n, real_t * x, real_t * y) {
      float *norm; norm = new float; 
      float alpha = 1.0f; float beta = 0.0f;

      cublasSgemv_v2(cublasHandle, CUBLAS_OP_N ,1 , n, &alpha, x, 1, y, 1, &beta, norm, 1);

      return *norm;
}

What can I do to make sure, that the result is calculated before the function returns? Of course insertion of cudaDeviceSynchronize() works, but as I mentioned, it serializes the execution of my kernel across streams.

alfC
  • 14,261
  • 4
  • 67
  • 118
  • You say that `cublasdot()` calls are synchronous. What do you mean? cuBLAS calls are executing asynchronously. I think there is no alternative to the use of `cudaDeviceSynchronize()` to achieve active waiting from the device. – Vitality Dec 13 '13 at 21:30
  • Indeed, cuBLAS API is, except for few Level 1 routines which return a scalar value, asynchronous as you wrote. Thanks for your answer, but perhaps someone has any other idea? – user3100782 Dec 17 '13 at 16:13

1 Answers1

1

Probably if you read the programming guide dynamic parallelism section carefully (especially streams, events, and synchronization), you may get some ideas. Here's what I came up with:

There is an implicit NULL stream (on the device) associated with the execution sequence that calls your _cDdot function (oddly named, IMHO, since you're working with float quantities in that case, i.e. using Sgemv). Therefore, any cuda kernel or API call issued after the call to cublasSgemv_v2 in your function should wait until any cuda activity associated with the cublasSgemv_v2 function is complete. If you insert an innocuous cuda API call, or else a dummy kernel call, after the call to cublasSgemv_v2, it should wait for that to be complete. This should give you the thread-level synchronization you are after. You might also be able to use a cudaEventRecord call followed by a cudaStreamWaitEvent call.

Here's an example to show the implicit stream synchronization approach:

#include <stdio.h>
#include <cublas_v2.h>
#define SZ 16

__global__ void dummy_kernel(float *in, float *out){
  *out = *in;
}

__device__ float _cDdot(cublasHandle_t & cublasHandle, const int n, float * x, float * y, const int wait) {
      float *norm; norm = new float;
      float alpha = 1.0f; float beta = 0.0f;
      *norm = 0.0f;
      cublasSgemv_v2(cublasHandle, CUBLAS_OP_N ,1 , n, &alpha, x, 1, y, 1, &beta, norm, 1);
      if (wait){
        dummy_kernel<<<1,1>>>(norm, norm);
        }
      return *norm;
}


__global__ void compute(){
  cublasHandle_t my_h;
  cublasStatus_t status;
  status = cublasCreate(&my_h);
  if (status != CUBLAS_STATUS_SUCCESS) printf("cublasCreate fail\n");
  float *x, *y;
  x = new float[SZ];
  y = new float[SZ];
  for (int i = 0; i < SZ; i++){
    x[i] = 1.0f;
    y[i] = 1.0f;}
  float result = _cDdot(my_h, SZ, x, y, 0);
  printf("result with no wait = %f\n", result);
  result = _cDdot(my_h, SZ, x, y, 1);
  printf("result with wait = %f\n", result);
}

int main(){

  compute<<<1,1>>>();
  cudaDeviceSynchronize();
  return 0;
}

compile with:

nvcc -arch=sm_35 -rdc=true -o t302 t302.cu -lcudadevrt -lcublas -lcublas_device

results:

$ ./t302
result with no wait = 0.000000
result with wait = 16.000000
$

Unfortunately I tried a completely empty dummy_kernel; that did not work, unless I compiled with -G. So the compiler may be smart enough to optimize out a complete empty child kernel call.

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • Thank you for your reply. Unfortunately, I am not sure if I agree with you idea. The function _cDdot is used in biconjugate gradient solver. For small problems, I wanted to lunch concurrently the solver for several different inputs by assigning the kernel lunch to different streams. For big problems, I have only one stream (say the default stream) - in that case I am better off using cublas function for calculation of the dot product (it doesn't matter it is synchronous, because there is only one stream)... – user3100782 Jan 09 '14 at 19:00
  • ....For concurrent lunch (for small matrices), your approach will still result in serialized execution. I played with cudaEventRecord and cudaStreamWaitEvent, but when called from the device, I can't get the desired behaviour :/ Thank you very much for your time!! – user3100782 Jan 09 '14 at 19:01