cublasSetVector() vs cudaMemcpy()

Question

I am wondering if there is a difference between:

// cumalloc.c - Create a device on the device
HOST float * cudamath_vector(const float * h_vector, const int m)
{
  float *d_vector = NULL;
  cudaError_t cudaStatus;
  cublasStatus_t cublasStatus;

  cudaStatus = cudaMalloc(&d_vector, sizeof(float) * m );

  if(cudaStatus == cudaErrorMemoryAllocation) {
    printf("ERROR: cumalloc.cu, cudamath_vector() : cudaErrorMemoryAllocation");
    return NULL;
  }


  /*    THIS: */ cublasSetVector(m, sizeof(*d_vector), h_vector, 1, d_vector, 1);

  /* OR THAT: */ cudaMemcpy(d_vector, h_vector, sizeof(float) * m, cudaMemcpyHostToDevice);


  return d_vector;
}

cublasSetVector() has two arguments incx and incy and the documentation says:

The storage spacing between consecutive elements is given by incx for the source vector x and for the destination vector y.

In the NVIDIA forum someone said:

iona_me: "incx and incy are strides measured in floats."

So does this mean that for incx = incy = 1 all elements of a float[] will be sizeof(float)-aligned and for incx = incy = 2 there would be a sizeof(float)-padding between each element?

Except for those two parameters and the cublasHandle - does cublasSetVector() anything else what cudaMalloc() doesn't do?
Would it be save to pass a vector/matrix which was not created with their respective cublas*() function to other CUBLAS functions to manipulate them?

As long as I know, `cublasSetVector()` will internally call `cudaMemcpy` or its 2D version for strided copies. So I think there is not problem even if the arrays to be set have been created by a `cudaMalloc`. Actually, I have been interchanging cuBLAS and non-cuBLAS instructions with no problem in the recent past. — Vitality, Jun 09 '14 at 13:55
@JackOLantern do you want to provide an answer? I agree with your statements and would upvote. — Robert Crovella, Jun 09 '14 at 14:05

score 8 · Accepted Answer · answered Jun 09 '14 at 15:15

There is a comment in a thread of the NVIDIA Forum provided by Massimiliano Fatica confirming my statement in the above comment (or, saying it better, my comment originated by a recall of having read the post I linked to). In particular

cublasSetVector, cubblasGetVector, cublasSetMatrix, cublasGetMatrix are thin wrappers around cudaMemcpy and cudaMemcpy2D. Therefore, no significant performance differences are expected between the two sets of copy functions.

Accordingly, you can safely pass any array created by cudaMalloc as input to cublasSetVector.

Concerning the strides, perhaps there is a misprint in the guide (as of CUDA 6.0), which says that

The storage spacing between consecutive elements is given by incx for the source vector x and for the destination vector y.

but perhaps should be read as

The storage spacing between consecutive elements is given by incx for the source vector x and incy for the destination vector y.

The CUBLAS documentation of `cublasSetVector` is missing `incy`as noted by @JackOLantern. Compare the description of `cublasGetVector` in the immediately following section. I have filed a bug to get the CUBLAS documentation fixed. — njuffa, Jun 09 '14 at 21:18

cublasSetVector() vs cudaMemcpy()

1 Answers1