Basic CUDA - getting kernels to run on the device using C++

Question

I'm new to CUDA & trying to get a basic kernel to run on the device. I have compiled the examples & then run so I know the device drivers work/CUDA can run successfully. My goal is to get my C++ code to call CADU to greatly speed up a task. I've been reading over a bunch of different posts online about how to do this. Specifically, [here]: Can I call CUDA runtime function from C++ code not compiled by nvcc?.

My question is very simple (embracingly so) when I compile & run my code (posted below) I get no errrors but the kernel does not appear to run. This should be trivial to fix but after 6 hours I'm at a loss. I'd post this on the NVIDIA forums but they're still down :/. I'm sure the answer is very basic - any help? Below is: my code, how I compile it, & the terminal outputs I see:

main.cpp

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
extern void kernel_wrapper(int *a, int *b);

int main(int argc, char *argv[]){
int a = 2;
int b = 3;

printf("Input: a = %d, b = %d\n",a,b);
kernel_wrapper(&a, &b);
printf("Ran: a = %d, b = %d\n",a,b);
return 0;
}

kernel.cu

#include "cuPrintf.cu"
#include <stdio.h>
__global__ void kernel(int *a, int *b){
int tx = threadIdx.x;
cuPrintf("tx = %d\n", tx);
switch( tx ){
  case 0:
    *a = *a + 10;
    break;
  case 1:
    *b = *b + 3;
    break;
  default:
    break;
  }
}

void kernel_wrapper(int *a, int *b){
  cudaPrintfInit();
  //cuPrintf("Anything...?");
  printf("Anything...?\n");
  int *d_1, *d_2;
  dim3 threads( 2, 1 );
  dim3 blocks( 1, 1 );

  cudaMalloc( (void **)&d_1, sizeof(int) );
  cudaMalloc( (void **)&d_2, sizeof(int) );

  cudaMemcpy( d_1, a, sizeof(int), cudaMemcpyHostToDevice );
  cudaMemcpy( d_2, b, sizeof(int), cudaMemcpyHostToDevice );

  kernel<<< blocks, threads >>>( a, b );
  cudaMemcpy( a, d_1, sizeof(int), cudaMemcpyDeviceToHost );
  cudaMemcpy( b, d_2, sizeof(int), cudaMemcpyDeviceToHost );
  printf("Output: a = %d\n", a[0]);
  cudaFree(d_1);
  cudaFree(d_2);

  cudaPrintfDisplay(stdout, true);
  cudaPrintfEnd();
}

I compile the above code from the terminal using the commands:

g++ -c main.cpp
nvcc -c kernel.cu -I/home/clj/NVIDIA_GPU_Computing_SDK/C/src/simplePrintf
nvcc -o main main.o kernel.o

When I run the code I get the following terminal output:

$./main
Input: a = 2, b = 3
Anything...?
Output: a = 2
Ran: a = 2, b = 3

It's clear that the main.cpp is being compiled correctly & calling the kernel.cu code. The obvious problem is that the kernel does not appear to run. I'm sure the answer to this is basic - VERY VERY BASIC. But I don't know what's happening - help please?

You should really check if any of the calls you make returns an error. — Bart, Jul 20 '12 at 21:38
Synchronize after the kernel call would be my guess, but Bart is also correct in any case. — ergosys, Jul 21 '12 at 02:36
@ergosys: the cudaMemcpy calls are blocking and will cause synchronization. — talonmies, Jul 21 '12 at 05:59

score 3 · Answer 1 · answered Jul 21 '12 at 19:47

Inside kernel_wrapper you have the following call:

kernel<<< blocks, threads >>>( a, b );

What you are doing is that you are passing to it the references to the variables that live on the host. The GPU cannot operate on them. The passed values have to live on the GPU. Basically passing d_1 and d_2 will solve the problem and the result will be a = 12 and b = 6.

kernel<<< blocks, threads >>>( d_1, d_2 );

Basic CUDA - getting kernels to run on the device using C++

1 Answers1

Linked