I am exploring CUDA 8.0 with Visual Studio 2015 (running on a GeForce GTX 1060).
I tried setting 2000 blocks to run 1024 threads each (values that are supported) but I get an error code 4 after launching the kernel. The blocks are not doing anything exotic, in fact I'm not even using shared memory. What am I doing wrong?
My code is as follows:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <stdlib.h>
__global__
void addKernel()
{
unsigned int i, ans = 0;
for (i = 0; i < 100000; i++)
{
ans += i;
}
}
int main()
{
addKernel << <2000, 1024 >> >();
cudaError_t cudaStatus = cudaDeviceSynchronize();
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaDeviceSynchronize returned error code %d after launching addKernel!\n", cudaStatus);
}
cudaDeviceReset();
getchar();
return 0;
}
Output:
cudaDeviceSynchronize returned error code 4 after launching addKernel!
When I cut the number of blocks in half, the error goes away. Interestingly, I can eliminate the error by reducing the 100,000 iterations of the loop in the kernel to 1,000 as well.