CUDA Program to add 1.0f to array does not appear to do so

Question

I have a CUDA program:

#include <stdio.h>
#include <cuda.h>
 
__global__ void array_add(float *a, int N)
{
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  if (idx<N) a[idx] = a[idx] + 1.0f;
}
 
int main(void)
{
  float *a_h, *a_d;
  const int N = 10;
  size_t size = N * sizeof(float);
  a_h = (float *)malloc(size);
  cudaMalloc((void **) &a_d, size);
  for (int i=0; i<N; i++) a_h[i] = (float)i;
  cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
  int block_size = 4;
  int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);
  array_add <<< n_blocks, block_size >>> (a_d, N);
  cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
  for (int i=0; i<N; i++) printf("%d %f\n", i, a_h[i]);
  free(a_h); cudaFree(a_d);
}

Actually I did not write this, it is a test program. But I do understand what it is supposed to do.

I compiled this on my (Linux Mint - Ubuntu 15.04) system with nvcc cuda.cu - this file is saved as cuda.cu.

It runs, but the output is the following:

Which isn't what I expected to see. I expected the value 1.0f to be added to all these values - but this doesn't appear to happen.

What is going wrong here?

CUDA Memcheck Results

Here are the results of CUDA Memcheck...

========= CUDA-MEMCHECK
========= Program hit cudaErrorInvalidDeviceFunction (error 8) due to "invalid device function " on CUDA API call to cudaLaunch. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x2ef313]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudart.so.6.5 (cudaLaunch + 0x17e) [0x3686e]
=========     Host Frame:./a.out [0xe02]
=========     Host Frame:./a.out [0xd01]
=========     Host Frame:./a.out [0xd23]
=========     Host Frame:./a.out [0xbbd]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf0) [0x20a40]
=========     Host Frame:./a.out [0x999]
=========
0 0.000000
1 1.000000
2 2.000000
3 3.000000
4 4.000000
5 5.000000
6 6.000000
7 7.000000
8 8.000000
9 9.000000
========= ERROR SUMMARY: 1 error

Guess something is wrong with my setup... Any ideas on what it could be / how to fix it?

GPU Type

I am running this on a GTX 260. My problem was solved by compiling with nvcc cuda.cu -arch=sm_11. Thanks to Robert for this.

`array_add <<< n_blocks, block_size >>> (a_d, N);` slightly confuses me. What is that? — cadaniluk, Oct 06 '15 at 17:36
You should add some error checking - currently you don't even know if your kernel has launched or not (I would guess not, since the code looks OK apart from the lack of error checking). — Paul R, Oct 06 '15 at 17:36
@cad This is CUDA-specific syntax, used for launching a kernel on a device, with a given number of blocks composed of a given number of threads. — user703016, Oct 06 '15 at 17:37
Your code runs correctly for me. So it's likely a machine setup issue. 1. Follow the verification steps in [the linux getting started guide.](http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#abstract) (e.g. try running a cuda sample code, such as `deviceQuery`) 2. Add [proper cuda error checking](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api). 3. run your code with `cuda-memcheck` — Robert Crovella, Oct 06 '15 at 17:37
@RobertCrovella How do I do these last 2 things you suggest - is `cuda-memcheck` a `nvcc` flag? — FreelanceConsultant, Oct 06 '15 at 17:38
`cuda-memcheck ./a.out` I've edited the previous comment with instructive links. — Robert Crovella, Oct 06 '15 at 17:39
Did you resolve [your previous question](http://stackoverflow.com/questions/32932005/nvidia-cuda-running-test-program-devicequery-fails) ? If you can't run deviceQuery then you're not going to get very far... — Paul R, Oct 06 '15 at 17:43
You are using CUDA 6.5, which does support GPUs of compute capability 1.x, but does not compile for them by default. You need to update your question with which GPU exactly are you running this on. In the meantime, compiling with `nvcc -arch=sm_11 cuda.cu` will likely give better results. Invalid device function usually means you have compiled for a GPU target that is not supported by your actual GPU, such as compiling for a cc2.0 target when your GPU is a cc1.1 GPU. — Robert Crovella, Oct 06 '15 at 17:43
This question (compiling under CUDA 6.5 for a cc1.x GPU) has come up before, I'm going to mark this as a duplicate of the others if no objection. GTX 260 is a [cc1.3 GPU](https://developer.nvidia.com/cuda-legacy-gpus). So compiling for `-arch=sm_11` will work for this code, you can also compile for `-arch=sm_13` as well. — Robert Crovella, Oct 06 '15 at 18:44

CUDA Program to add 1.0f to array does not appear to do so

CUDA Memcheck Results

GPU Type

0 Answers0