I am trying to perform many cuBLAS operations (e.g., matrix-matrix multiplications) in parallel using different cuBLAS handles and associating a different stream with each handle, but I am getting a very strange Segmentation fault. Let's say we need to perform the matrix-matrix operation C = A*A
where A
is an M-by-M matrix (M=1024) and assume we need to perform N_STREAM
such computations in parallel. The data for the matrices A
are packed in a large array of dimension M*M*N_STREAM
. This is the code I wrote to carry out the parallel multiplication:
#define M 1024
#define N_STREAMS 1
int main(void) {
/* Declarations */
cublasHandle_t handles[N_STREAMS];
curandGenerator_t cuda_rng;
ptrdiff_t i;
double *d_data = NULL, *d_C = NULL; // device data
double alpha = 1.0, beta = 0.0, h_C[M * M * N_STREAMS]; //host data
for (i = 0; i < M * M * N_STREAMS; i++) {
h_C[i] = 0.0;
}
/* Allocate space on the device for the random data `d_data` : */
CUDA_CALL(cudaMalloc((void**) &d_data, sizeof(double) * M * M * N_STREAMS));
CUDA_CALL(cudaMalloc((void**) &d_C, sizeof(double) * M * M * N_STREAMS));
CUDA_CALL( cudaMemcpy(d_C, h_C, M*M*N_STREAMS*sizeof(double), cudaMemcpyHostToDevice));
/* Start N_STREAMS streams */
for (i = 0; i < N_STREAMS; i++) { CUBLAS_CALL( cublasCreate( &handles[i])); }
/* Start the RNG */
/* Construct the RNG */
CURAND_CALL( curandCreateGenerator(&cuda_rng, CURAND_RNG_PSEUDO_DEFAULT));
CURAND_CALL( curandGenerateUniformDouble(cuda_rng,d_data, M*M*N_STREAMS));
/* Load the random data */
printf("Generating data. Size = %d\n", M * M * N_STREAMS);
CURAND_CALL(
curandGenerateUniformDouble(cuda_rng, d_data, M * M * N_STREAMS));
for (i = 0; i < N_STREAMS; i++) {
CUBLAS_CALL(
cublasDgemm(handles[i], CUBLAS_OP_N, CUBLAS_OP_N, M, M, M,
&alpha, d_data + i*M*M, M, d_data+i*M*M, M, &beta, d_C+i*M*M, M));
}
CUDA_CALL( cudaMemcpy(h_C, d_C, M * M * N_STREAMS * sizeof(double), cudaMemcpyDeviceToHost));
printf("%g\n\n", h_C[1]);
for (i = 0; i < N_STREAMS; i++) {
CUBLAS_CALL( cublasDestroy( handles[i]));
}
CURAND_CALL( curandDestroyGenerator(cuda_rng));
CUDA_CALL( cudaFree(d_data));
CUDA_CALL( cudaFree(d_C));
}
The functions CUDA_CALL
, CURAND_CALL
and CUBLAS_CALL
serve merely the purpose of informing me when something goes wrong and interrupting the execution (find the code here). I set M to 1024, but even with N_STREAMS equal to 1, it fails (It gives a Segmentation fault (core dumped)
with no more information)!
I don't think that the size is such a problem (notice here that I have set N_STREAMS to 1, so it
s like using a single stream/handle). I have successfully tried matrix-matrix multiplications with cuBLAS for dimensions 5000-by-5000 and higher. In the above example, if we decrease M (e.g., #define M 50
), it works...
Update
After the comments I received, I changed the stack allocation to dynamic allocation for h_C
and now it works. The whole source can be found on github. Any further comments are welcome.