0

I am trying to perform many cuBLAS operations (e.g., matrix-matrix multiplications) in parallel using different cuBLAS handles and associating a different stream with each handle, but I am getting a very strange Segmentation fault. Let's say we need to perform the matrix-matrix operation C = A*A where A is an M-by-M matrix (M=1024) and assume we need to perform N_STREAM such computations in parallel. The data for the matrices A are packed in a large array of dimension M*M*N_STREAM. This is the code I wrote to carry out the parallel multiplication:

#define M 1024
#define N_STREAMS 1
int main(void) {
    /* Declarations */
    cublasHandle_t handles[N_STREAMS];
    curandGenerator_t cuda_rng;
    ptrdiff_t i;
    double *d_data = NULL, *d_C = NULL; // device data
    double alpha = 1.0, beta = 0.0, h_C[M * M * N_STREAMS]; //host data

    for (i = 0; i < M * M * N_STREAMS; i++) {
        h_C[i] = 0.0;
    }

    /* Allocate space on the device for the random data `d_data` : */
    CUDA_CALL(cudaMalloc((void**) &d_data, sizeof(double) * M * M * N_STREAMS));
    CUDA_CALL(cudaMalloc((void**) &d_C, sizeof(double) * M * M * N_STREAMS));
    CUDA_CALL( cudaMemcpy(d_C, h_C, M*M*N_STREAMS*sizeof(double), cudaMemcpyHostToDevice));

    /* Start N_STREAMS streams */
    for (i = 0; i < N_STREAMS; i++) { CUBLAS_CALL( cublasCreate( &handles[i])); }

    /* Start the RNG */
    /* Construct the RNG */
    CURAND_CALL( curandCreateGenerator(&cuda_rng, CURAND_RNG_PSEUDO_DEFAULT));
    CURAND_CALL( curandGenerateUniformDouble(cuda_rng,d_data, M*M*N_STREAMS));

    /* Load the random data */
    printf("Generating data. Size = %d\n", M * M * N_STREAMS);
    CURAND_CALL(
            curandGenerateUniformDouble(cuda_rng, d_data, M * M * N_STREAMS));
    for (i = 0; i < N_STREAMS; i++) {
        CUBLAS_CALL(
                cublasDgemm(handles[i], CUBLAS_OP_N, CUBLAS_OP_N, M, M, M,
                        &alpha, d_data + i*M*M, M, d_data+i*M*M, M, &beta, d_C+i*M*M, M));
    }

    CUDA_CALL( cudaMemcpy(h_C, d_C, M * M * N_STREAMS * sizeof(double), cudaMemcpyDeviceToHost));
    printf("%g\n\n", h_C[1]);

    for (i = 0; i < N_STREAMS; i++) {
        CUBLAS_CALL( cublasDestroy( handles[i]));
    }
    CURAND_CALL( curandDestroyGenerator(cuda_rng));
    CUDA_CALL( cudaFree(d_data));
    CUDA_CALL( cudaFree(d_C));
}

The functions CUDA_CALL, CURAND_CALL and CUBLAS_CALL serve merely the purpose of informing me when something goes wrong and interrupting the execution (find the code here). I set M to 1024, but even with N_STREAMS equal to 1, it fails (It gives a Segmentation fault (core dumped) with no more information)!

I don't think that the size is such a problem (notice here that I have set N_STREAMS to 1, so it s like using a single stream/handle). I have successfully tried matrix-matrix multiplications with cuBLAS for dimensions 5000-by-5000 and higher. In the above example, if we decrease M (e.g., #define M 50), it works...

Update After the comments I received, I changed the stack allocation to dynamic allocation for h_C and now it works. The whole source can be found on github. Any further comments are welcome.

Pantelis Sopasakis
  • 1,902
  • 5
  • 26
  • 45
  • 3
    What happens if you dynamically allocate h_C? Are you sure you are not exceeding runtime stack size? – talonmies Apr 08 '14 at 16:43
  • 1
    Why not include the necessary `#include` and `#define` statements needed to make this a complete code? How does stripping those few lines out make this question more useful? Anyway, after wrapping your code with a few extra needed lines, I was able to compile and run it with no problems. So I think the stack allocation issue suggested by @talonmies is the best guess (which has nothing to do with CUDA). Stack behavior can vary from OS to OS. – Robert Crovella Apr 08 '14 at 18:12
  • @talonmies @RobertCrovella Thanks for the answers. I allocated h_C dynamically and now it works (I'll paste the complete working example here tomorrow). The question not is why can I not use static allocation like `h_C[M * M * N_STREAMS]`? – Pantelis Sopasakis Apr 08 '14 at 23:09
  • 1
    Because allocating [large variables on the stack is problematic](http://stackoverflow.com/questions/79923/what-and-where-are-the-stack-and-heap). It has nothing to do with CUDA. – Robert Crovella Apr 09 '14 at 01:00
  • @RobertCrovella So I ran into a kind of stack overflow? What defines an array as large? (I guess a rule of thumb would be to allocate memory considerably smaller than the stack size.) – Pantelis Sopasakis Apr 09 '14 at 09:14

0 Answers0