Compiling CUDA program

Question

I am strugling with parallelizing a RayTracing program, using CUDA. I have the sequential code, and I have wrote the parallel code (kernel).

When running the program, I encounter the following error (copied from VS2010)

Error   1   error MSB3721: The command ""C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.2\bin\nvcc.exe" -gencode=arch=compute_21,code=\"sm_21,compute_21\" -gencode=arch=compute_10,code=\"sm_10,compute_10\" --use-local-env --cl-version 2010 -ccbin "C:\Program Files\Microsoft Visual Studio 10.0\VC\bin"  -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.2\include"    --keep-dir "Release" -maxrregcount=0  --machine 32 --compile      -Xcompiler "/EHsc /nologo /Od /Zi  /MD  " -o "Release\CUDAraytracer.cu.obj" "c:\Users\mc.choice\Desktop\CUDAraytracer.cu"" exited with code -1.  C:\Program Files\MSBuild\Microsoft.Cpp\v4.0\BuildCustomizations\CUDA 4.2.targets    361

I think I have all libs and headers included correctly.

And ideas on how to compile & run it successfully, and what the cause of the error would be?

Tnx in advance

There may not be enough here to go on. Can you provide more of the output? Such as the lines before and after the one you posted? Can you provide the program `CUDAraytracer.cu` ? Someone else could do a test compile and see if there are any issues. Is there some reason you're using CUDA 4.2? It's pretty old now. Are you able to build any of the cuda sample applications? — Robert Crovella, Feb 06 '14 at 17:43
Hi. This is the only output I get. I'm using CUDA 4.2 because of my old graphics card - 9500GS, driver version 306.23 . Here is the CUDAraytracer.cu (too long to post, so here's the link) "http://codeshare.io/RGw1e". And jeah, I can get some of the sample applications to run, eventhough I have to restart my PC every time I run one, due to screen flashing and freezing, lol... Tnx. — ChoiceMC, Feb 06 '14 at 18:03
Some of your VS settings may be preventing you from seeing the actual output from `nvcc`. Can you try issuing the command from the command prompt? Basically copy everything starting with `"C:\Program Files...` and ending with `...CUDAraytracer.cu"` and enter it in as a command at the command prompt — Robert Crovella, Feb 06 '14 at 18:17
Interesting... "Unsupported gpu architecture 'compute_21'". why is that? I changed it in Project settings / CUDA C/C++ / Device... Originally, it was 'compute_10', but with that setting I had the following error: Error 20 error : Recursive function call is not supported yet: calculateReflection(double*, double*, double*, RAY, int, SPHERE_INTERSECTION, double, double, double, int) c:\Users\mc.choice\Desktop\CUDAraytracer.cu 529 What to do? — ChoiceMC, Feb 06 '14 at 18:43
Change the project settings and/or file settings to `sm_20` instead of `sm_21`. `compute_21` is invalid (it should be `compute_20`, combined with `sm_21`), and without sitting in front of your VS I'm not sure I can debug why VS is spitting out `compute_21` but there is not much difference between `sm_20` and `sm_21`, and changing that setting should work around this issue. As another test, you might reissue your command line test that produced this output, but change each instance of `compute_21` to `compute_20` (you can leave the instances of `sm_21` alone. I would expect different results — Robert Crovella, Feb 06 '14 at 18:59
If you want to use recursive functionality, you'll also want to *eliminate* the `compute_10` and `sm_10` settings. You should be able to do this by proper manipulation of the project settings (or possibly the file settings for that particular .cu file) in VS. — Robert Crovella, Feb 06 '14 at 19:01
I can't believe this... I get rid of 1 error, 33 new come up. Now I'm facing **Error 39 error LNK2001: unresolved external symbol ___cudaRegisterFatBinary@4 c:\Users\mc.choice\documents\visual studio 2010\Projects\ray_cuda\ray_cuda\CUDAraytracer.cu.obj** errors, as long with **__cl** and **__glew** errors... — ChoiceMC, Feb 06 '14 at 19:21
well you're past the compile errors, anyway, and on to the link errors. Honestly, your VS/CUDA install seems messed up. Once again, just giving me your code is not enough to sort this out. It requires understanding a variety of VS project settings, such as which CUDA libraries are being linked in. It looks to me like `cudartXX_42_9.dll` is not being properly linked in. (XX = 32 or 64 depending on host OS). — Robert Crovella, Feb 06 '14 at 19:26
Would it be easier for you, if I shared my files with you, so you can try some magic? One more question - would it be easier if I would try OpenCL instead of CUDA? — ChoiceMC, Feb 06 '14 at 19:30
I really don't know much about OpenCL. Sharing your files with me is not likely to help, because the problems lie in how your Visual Studio is configured. If possible, I'd suggest wiping the slate clean and reinstalling VS, followed by a re-install of the CUDA toolkit. Independent of this particular code/issue, an important goal should be to get your CUDA and VS install stable enough so that you can compile and run sample codes without having to restart your machine. — Robert Crovella, Feb 06 '14 at 19:34
I will try to do that. Tnx anyway, I really appreciate your time & help Robert. I hope I can get this to work till sunday, or else I'm pretty much screwed. :) all the best. — ChoiceMC, Feb 06 '14 at 19:36
I guess maybe I should point out that a 9500GS is a cc 1.x device, and code compiled for `compute_20`/`sm_20` (or `sm_21`) would not run on it. You would need a cc 2.0 or better device for that (e.g. for the recursive functionality). — Robert Crovella, Feb 06 '14 at 19:44
Oh... tnx... I'll just go rewrite the code then :) tnx again! — ChoiceMC, Feb 06 '14 at 20:09

Robert Crovella · Accepted Answer · 2014-02-10T03:23:40.393

0

In this particular case, the error initially described in the question is originating from this particular set of command line switches being passed to nvcc:

-gencode=arch=compute_21,code=\"sm_21,compute_21\"

compute_21 is not a valid virtual architecture.

Why exactly Visual Studio is generating that particular invalid switch is not clear. However that particular issue can be worked around by changing the project settings to sm_20 in any place where sm_21 shows up. This should not have a significant effect on code generation, and has no effect on supported capability of the code.

As discussed in the comments, OP seems to have other issues as well with the Visual Studio configuration.

EDIT: I ran the program you provided in your recent comment. It seems to run "correctly" for me. I ran it under linux, rather than windows, because that was the machine I had handy to do this type of testing. I didn't make any changes to your program except to change some of the include files to match the linux pathnames, etc. The main issue I observed is that in general, it seems to take about 17 seconds per frame to render. If your GPU is much much slower, you may have to wait several minutes to see the first frame. Here's sample output:

enter image description here

So I would say that the main issue is to improve the rendering speed. I haven't spent a lot of time looking over your program yet, but any kernel called with <<<1,1>>> configuration is not really making effective use of the GPU.

The GPU I used for this is a Quadro1000M GPU, which may be significantly faster than your 9500GS.

edited Feb 10 '14 at 03:23

answered Feb 06 '14 at 20:13

Robert Crovella

143,785
11
213
257

Hey. I've stuck with CUDA, and got to a point where everything compiles & runs successfuly. I've put everything into the *.cu file. Now the problem is, when I run the program, it shows black screen only. No bouncing spheres... Other than that I think it is working, as measuring times are shown in output (I measure time of execution for every frame). Any ideas on that? Code can be found here: http://codeshare.io/ZGVAj – ChoiceMC Feb 07 '14 at 15:48
Not really. Are you doing error checking on all the CUDA calls and kernel calls? Are you using some from of OpenGL or DX interop? You had a CPU only version that was working correctly, right? – Robert Crovella Feb 07 '14 at 15:51
I don't see any proper [cuda error checking](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api) in your code. I would definitely start there, by adding it to all cuda API calls and after all kernel calls. You can also get a quick sense of whether there are CUDA issues by running your code with `cuda-memcheck` – Robert Crovella Feb 07 '14 at 15:56
Right - the CPU only version worked OK. Hmm, never thought of cuda error checking, tnx for the hint :) Like I said - I'm a beginner in this, especially parallel computing, but would like to gain knowledge about it. Thank god for sites like this. Will report results after work, tnx again. – ChoiceMC Feb 07 '14 at 18:12
So I've tried implementing the **gpuErrchk()**. Everything runs fine, as before, but I get no output of any errors, and the output screen is still black. No spheres. I'm running out of ideas. Maybe the real problem really lies in my old graphics card & drivers. The driver really is old, but if I even try installing a newer one, everything goes to smithereens (freezes, crashes at Win boot...), and since the deadline for this task is due sunday, I can't afford the time loss. By the way - what did you mean by **Are you using some from of OpenGL or DX interop?** ? – ChoiceMC Feb 08 '14 at 15:21
If you want, and if it's a single file I can compile easily, I'll take a look at it. OpenGL interop is a set of APIs to easily coordinate data exchange between OpenGL and CUDA, without leaving the GPU. OpenGL interop is documented [here](http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__OPENGL.html#group__CUDART__OPENGL) and DX interop [here](http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__D3D11.html#group__CUDART__D3D11). There are various sample codes for each such as [this one](http://docs.nvidia.com/cuda/cuda-samples/index.html#cuda-and-opengl-interop-of-images). – Robert Crovella Feb 08 '14 at 15:44
Yes, it is a single file. I've integrated everything into one **.cu** file. If you would try to run it and tell me what your results are, it would be deeply appreciated. Here's the file:https://app.box.com/s/luuatddduruy72vn9vmb . Thank you very much – ChoiceMC Feb 09 '14 at 11:57
Updated my answer with sample output from your program. – Robert Crovella Feb 10 '14 at 02:25
On my machine, your `disp` loop is taking about 16 seconds. Virtually all of that time is spent in the loop where you are calling `glColor3f` (i.e. moving the cuda rendered pixels into OpenGL). The cuda portion is only taking about 40 milliseconds. Also, you are using `double` types in your code, but devices of compute capability 1.1 do not support `double`. These should be automatically, transparently demoted to float by the compiler, but not sure why you need `double` values for your pixel components. – Robert Crovella Feb 10 '14 at 03:23
try using `glDrawPixels()` instead of that loop with `glColor3f`. Your `disp` routine will run much more quickly (converting your `pixel` struct to `float` instead of `double` will facilitate this as well). – Robert Crovella Feb 10 '14 at 04:02
Wow... that long? I was getting times about 0.01s - 0.3s with CUDA, but then again.. with no spheres, no drawing. The sequential code needs about 2-3 seconds/frame. Maybe the CUDA time measuring functions arent not where they should be? I will try implementing **glDrawPixels()** instead, and I have changed pixel struct to float. Will see what happens. But I would need some help implementing the glDrawPixels function... thanks for your help, I really appreciate it. – ChoiceMC Feb 10 '14 at 08:36
When I converted to glDrawPixels, I could get around 20 fps from your code. The cuda routine in `disp` is running in about 40 ms on my machine. Can I post the code back for you? – Robert Crovella Feb 10 '14 at 09:29
Of course you can, please do. Or you can edit it in the online editor I posted few comment above: http://codeshare.io/ZGVAj. Or by any means you like :) – ChoiceMC Feb 10 '14 at 09:49
Take a look at what is posted [here](http://pastebin.com/srg3PaMK). I'm running it under linux, but the only thing I think you should have to change is the include file paths under windows. The output is rotated by 90 degrees, but that is easy to fix. With this code it runs pretty well on my machine and I get around 20 fps. Your machine may be slower, so you may want to adjust the timer milliseconds upward from 50. You may also want to convert all `double` to `float`, but the compiler should do that for cc 1.1. Oh, and the extra timing function I added is for linux, you can comment that. – Robert Crovella Feb 10 '14 at 10:00
Wow, it's acctually VERY fast with glDrawPixels! Oh, so the last argument is **h_pixels** ... I couldn't figure out what to put there. Well I still get no output, but now I'm pretty sure it's my hardware's fault now... I'll just submit the task, do some analysis, and submit the report too. I wish I could meet you, and buy you a beer or something, you helped me A LOT! Thanks again! – ChoiceMC Feb 10 '14 at 11:07
I think the no output problem in your case might be due to the fact that one of the kernels (`calc`) is getting compiled in a way that it uses too many registers *and* the cuda error checking implemented on the kernel is not quite right to catch this condition. Take a look at [proper cuda error checking](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api) especially the kernel error checking. As a simple fix, try compiling with an additional nvcc command line switch `-maxrregcount 32` and it may start working. – Robert Crovella Mar 17 '14 at 03:02

Compiling CUDA program

1 Answers1

Linked