Simple CUDA code not working

Question

/* Include Libs */
  #include <cuda.h>
  #include <cuda_runtime.h>
  #include <cuda_runtime_api.h>
  #include <stdio.h> 
  #include <stdlib.h>
  #include <iostream>


int main(void)
{

int a=1;
int c=2;
int *deva, *devc;

cudaMalloc( (void**)&deva, sizeof(int) ); 
cudaMalloc( (void**)&devc, sizeof(int) );

cudaMemcpy( deva, &a, sizeof(int), cudaMemcpyHostToDevice );
cudaMemcpy( devc, &c, sizeof(int), cudaMemcpyHostToDevice );

cudaMemcpy( &c, deva, sizeof(int), cudaMemcpyDeviceToHost );
cudaMemcpy( &a, devc, sizeof(int), cudaMemcpyDeviceToHost );

printf("\n%d %d\n", a, c); // Output (should be "2 1")

cudaFree( deva );
cudaFree( devc );

return 0;

}

This simple code should swap a=1 and c=2, producing an output "2 1", but it does nothing and this isn't the only "simple" example based on a textbook that isn't working, for example why do all the textbooks say I can initialize a and c as pointers and later fill in their values, but the program won't compile if I do that? What am I overlooking here?

You can do the error checking as mentioned in [here](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api) Most likely you will find the problem yourself. — Sagar Masuti, Dec 10 '13 at 02:22
Mihai Maruseac I don't need a kernel to simply swap values, do I? Sagar Masuti Does it work for you without you having changed anything? — user3085127, Dec 10 '13 at 02:30
It does work for me without any changes. [Check this](http://pastebin.com/5rY5rfRN) — Sagar Masuti, Dec 10 '13 at 02:40
I found these threads: [link](http://stackoverflow.com/questions/15178285/wrong-results-in-cudamemcpydevicetohost?rq=1) and [link](http://stackoverflow.com/questions/15177019/cuda-program-does-not-give-the-correct-output-when-using-a-cuda-compatible-gpu) those might explain some things. I am using CUDA 5.0 while the remote computer I compile on has not been updated in 2 years (before CUDA 5.0). Can't believe I wasted 6 hours thinking I made some crucial mistake somewhere... — user3085127, Dec 10 '13 at 02:45
The code works also for me. Which architecture are you targeting? What is the compilation string? — Vitality, Dec 10 '13 at 06:10
My .cu file is called 4.cu, I compile with "nvcc 4.cu -o 4", then run with "./4", nothing more. I don't know the type of the graphic card of the computer because I access it through a university server, it's not mine (all I know is it's at least 2 years old and probably a quadro, there is a possibility the drivers on the card are outdated and that my CUDA version is too recent but then again my script is as basic as it gets). — user3085127, Dec 10 '13 at 15:44
As already suggested, if you add [proper cuda error checking](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api) to your code, you'll have a much better idea of what is going wrong, even when you are running it remotely. — Robert Crovella, Dec 10 '13 at 16:44

Arun · Answer 1 · 2013-12-11T09:38:21.230

1

Have a kernel thread with global declaration

__global__ void swap(int *a,int *c)
{
 //Swapping using xor operator
 *a=*a^*c;
 *c=*a^*c;
 *a=*a^*c;
}

call the Kernel thread from the main function as swap<<<N,1>>>(deva, devb, devc)

you can have further reference from http://on-demand.gputechconf.com/gtc-express/2011/presentations/GTC_Express_Sarah_Tariq_June2011.pdf

edited Dec 11 '13 at 09:38

answered Dec 10 '13 at 15:13

Arun

105
9

`__global__ void swap(void)` ? Did you mean `__global__ void swap(int *a, int *c)` ? – Sagar Masuti Dec 11 '13 at 01:29
@SagarMasuti i changed it...its actually void swap(int *a,int *c)...Thank u! – Arun Dec 11 '13 at 09:39
There are many problems with this code. The swapping method is a bad idea. The number of reads and writes to global memory is significantly higher than a kernel that makes use of a single temporary storage value, and so it will run more slowly. The kernel, as written, only knows how to swap a single value, and is missing many aspects of a normal cuda kernel. Furthermore, calling kernels with `<<>` parameters is a *bad* way to structure CUDA code. It appears in the linked presentation, but is for education only, as students are being introduced to the concepts of blocks and threads. – Robert Crovella Dec 12 '13 at 19:56
@RobertCrovella: Thank you for the comments!will try to solve the overhead on execution.And,in cuda we actually call the cuda function with <<>>(parameters) right?.How to structure it in good way? – Arun Dec 13 '13 at 02:04

score 1 · Accepted Answer · edited May 23 '17 at 10:25

1

Ok, I applied this to the parallel parts of my code and the result is a single error message: "no CUDA-capable device is detected". So I guess something in the network isn't working properly or I don't have all the permissions.

UPDATE: the network administrator changed some settings and everything works now.

edited May 23 '17 at 10:25

Community

1
1

answered Dec 10 '13 at 21:33

user3085127

13
5

Simple CUDA code not working

2 Answers2