I have written a simple matrix multiplication code using CUDA, when I run code for input size of A(10000*10000)*B(10000*10000)
, I receive this message:
cudaDeviceSynchronize returned error code 4 after launching
After adding these instructions in order to measure run time, I recieve "unspecified launch failure" error.
cudaEventRecord(start);
// here is my kernel call
cudaEventRecord(stop);
cudaEventSynchronize(stop);
this is my kernel call:
mulKernel<<<1, dataSet.threadSize>>>(dev_c, dev_a, dev_b, dataSet.n, dataSet.m, dataSet.p, dataSet.threadSize);
and this is my kernel code:
int i = threadIdx.x;
int j, k, sum;
//if(n<=threadSize)
for(; i < n; i+=threadSize){
for(j = 0; j < p; j++){
sum = 0;
for(k = 0; k < m; k++){
sum += A[i * m + k] * B[k * p + j];
}
C[i *p + j] = sum;
}
}
How can I fix this error?