Multi-GPU cuda programming: launching one kernel from multiple devices

Question

I am learning how to code using cuda on mutiple GPUs. The compute capability of my devices is 4.0, so I understand that I can code cuda for multiple GPUs using one host thread. To start with, I referred to one of the SDK examples "simpleMultiGPU.cpp" and ran it using one and two GPUs. I find that the performance in both cases is very similar, i.e., running code in two GPUs is as slow/fast as running it in one GPU. I understand that it is not an optimized code, but that is really an example that I have in front of me to start with, which isnt working. I think the code in the two devices is running serially.

Has someone experienced the same problem, while running this SDK example?

I wrote a simple vector addition code based on this SDK example, and it also performs in a similar way (as expected). I am using asynchronous cuda calls and also using pinned host memory. I am trying to understand the reason behind this behavior.

Any insight will be highly appreciated.

Here is a copy of the main code:

typedef struct {
float* vec;
int N;
} vector;

extern "C" {

//Define kernel for vector addition
__global__ void vecadd_kernel(float *avec, int N, float* bvec, float *cvec){
int tId=blockIdx.x*blockDim.x+threadIdx.x;

if(tId < N)
    cvec[tId]=avec[tId]+bvec[tId];


}

void launch_addvec_kernel(float *avec, int N, float* bvec, float *cvec, int THREAD_N, int BLOCK_N, cudaStream_t &s){

vecadd_kernel<<< BLOCK_N, THREAD_N, 0, s >>> (avec,N,bvec,cvec);
        getLastCudaError("reduceKernel() execution failed.\n");

}


}

int main(){

clock_t lapse;
float cpu_time;
lapse=clock();
vector avec, bvec, cvec, cvec_gpu;

int N=256*256*256;
int threads=256;

avec.N=N; 
bvec.N=avec.N; 
cvec.N=avec.N; 

avec.vec=(float*)malloc(sizeof(float)*avec.N);
bvec.vec=(float*)malloc(sizeof(float)*bvec.N);
cvec.vec=(float*)malloc(sizeof(float)*cvec.N);
cvec_gpu.vec=(float*)malloc(sizeof(float)*avec.N);

for(int i=0;i<avec.N;++i){
    avec.vec[i]=i;
    bvec.vec[i]=i;
}


//Normal CPU addition
#pragma unroll
for(int i=0;i<avec.N;++i){
    cvec.vec[i]=avec.vec[i]+bvec.vec[i];
}

cpu_time=clock()-lapse;

printf("CPU execution time = %f seconds \n",cpu_time/CLOCKS_PER_SEC);

//-------------------- Multi-GPU code -------------------------------
//-------------------- Multi-GPU code -------------------------------
//-------------------- Multi-GPU code -------------------------------

//Get number of CUDA enabled devices
lapse=clock();
int deviceCount;
cudaGetDeviceCount(&deviceCount);
//deviceCount=1;

vector apartvecs[deviceCount], bpartvecs[deviceCount], cpartvecs[deviceCount];
vector apartvecs_gpu[deviceCount], bpartvecs_gpu[deviceCount],  cpartvecs_gpu[deviceCount];

int i,j;

    //Subdividing input data across GPUs
    //Get data sizes for each GPU
for (i=0; i<deviceCount; ++i)
    apartvecs[i].N = N/deviceCount;

    //Take into account "odd" data sizes
for (i=0; i<N%deviceCount; ++i)
    ++apartvecs[i].N; 


int offset[deviceCount];

offset[0]=0;
offset[1]=apartvecs[0].N;

cudaStream_t stream[deviceCount];

    //Create streams for issuing GPU command asynchronously and allocate memory (GPU and System page-locked)
for (i=0; i<deviceCount; ++i){
        checkCudaErrors( cudaSetDevice(i) );
        checkCudaErrors( cudaStreamCreate(&stream[i]) );

    cpartvecs[i].vec=(float*)malloc(sizeof(float)*apartvecs[i].N);

    memset(cpartvecs[i].vec,'\0',sizeof(float)*apartvecs[i].N);

    //Allocate device memory
        checkCudaErrors( cudaMalloc((void**)&apartvecs_gpu[i].vec, apartvecs[i].N * sizeof(float)) );
        checkCudaErrors( cudaMalloc((void**)&bpartvecs_gpu[i].vec, apartvecs[i].N * sizeof(float)) );
        checkCudaErrors( cudaMalloc((void**)&cpartvecs_gpu[i].vec, apartvecs[i].N * sizeof(float)) );

    //Allocate pinned memory on host
        checkCudaErrors( cudaMallocHost((void**)&apartvecs[i].vec, apartvecs[i].N * sizeof(float)));
        checkCudaErrors( cudaMallocHost((void**)&bpartvecs[i].vec, apartvecs[i].N * sizeof(float)));

    for (j=0;j<apartvecs[i].N;++j){
        int j1=j+offset[i]; 
        apartvecs[i].vec[j]=avec.vec[j1];
        bpartvecs[i].vec[j]=bvec.vec[j1];
        //printf("%d \t %d \t %d \t %d \t %f\n",i,j,offset[i],j1,apartvecs[i].vec[j]);
    }
}

    //Copy data to GPU, launch the kernel and copy data back. All asynchronously
for (i=0; i<deviceCount; ++i){

        //Set device
        checkCudaErrors( cudaSetDevice(i) );

        //Copy input data from CPU
        checkCudaErrors( cudaMemcpyAsync(apartvecs_gpu[i].vec, apartvecs[i].vec, apartvecs[i].N * sizeof(float), cudaMemcpyHostToDevice, stream[i]) );
        checkCudaErrors( cudaMemcpyAsync(bpartvecs_gpu[i].vec, bpartvecs[i].vec, apartvecs[i].N * sizeof(float), cudaMemcpyHostToDevice, stream[i]) );

    int numblocks = N/threads;

//  printf("before kernel %d \n",apartvecs[i].N);
    launch_addvec_kernel(apartvecs_gpu[i].vec,apartvecs[i].N,bpartvecs_gpu[i].vec,cpartvecs_gpu[i].vec,threads,numblocks,stream[i]);

        //Read back GPU results

        checkCudaErrors( cudaMemcpyAsync(cpartvecs[i].vec, cpartvecs_gpu[i].vec, apartvecs[i].N * sizeof(float), cudaMemcpyDeviceToHost, stream[i]) );

    //printf("here 5\n");

}

    //Process GPU results
    for(i = 0; i < deviceCount; i++){

        //Set device
        checkCudaErrors( cudaSetDevice(i) );

        //Wait for all operations to finish
        cudaStreamSynchronize(stream[i]);

// cudaDeviceSynchronize();

    for(int j=0; j<apartvecs[i].N; ++j){
        int j1=j+offset[i]; 
        cvec_gpu.vec[j1]=cpartvecs[i].vec[j];
        //printf("%d \t %d \t %d \t %d \t %f\n",i,j,offset[i],j1,cvec_gpu.vec[j1]);
    }

        //Shut down this GPU
        checkCudaErrors( cudaFreeHost(apartvecs[i].vec) );
        checkCudaErrors( cudaFreeHost(bpartvecs[i].vec) );
        checkCudaErrors( cudaFree(apartvecs_gpu[i].vec) );
        checkCudaErrors( cudaFree(bpartvecs_gpu[i].vec) );
        checkCudaErrors( cudaFree(cpartvecs_gpu[i].vec) );
        checkCudaErrors( cudaStreamDestroy(stream[i]) );

}

free(avec.vec);
free(bvec.vec);
free(cvec.vec);
free(cvec_gpu.vec);
cpu_time=clock()-lapse;

printf("GPU execution time = %f seconds \n",cpu_time/CLOCKS_PER_SEC);

}

show us how you're executing your kernel on two GPU devices. — dthorpe, Nov 09 '12 at 00:16
could you fix the formatting on your code please? AFAIK there are no "compute capability 4.0" devices. Probably you mean your CUDA version? Could you briefly describe your multi-GPU setup? What system are you using and what are the 2 (or more) GPUs installed? A simple vector add is probably not a great code choice for trying to exercise 2 GPUs in parallel. Much of the time spent will be in the cudaMemcpy operations. Could you provide some specific results as well as what your expected results are? — Robert Crovella, Nov 09 '12 at 00:43
can you also provide a complete, compilable example? with #include statements and every other line of code needed to build and test your app? — Robert Crovella, Nov 09 '12 at 00:44
[possible duplicate](http://stackoverflow.com/questions/10796634/multiple-gpus-on-one-device-no-speedup-in-simplemultigpu-example) You may also be interested in [this SO question](http://stackoverflow.com/questions/10529972/multi-gpu-basic-usage). — Robert Crovella, Nov 09 '12 at 01:06
Thanks Robert for this link. It is the same problem that I am facing. I am going to work on the suggestions made in that thread. thanks! — shadowfax, Nov 09 '12 at 17:03

Multi-GPU cuda programming: launching one kernel from multiple devices

0 Answers0