1

I am trying to learn CUDA by writing basic code, which should hopefully put me in a better position to convert my existing C++ code to CUDA (for research).

I need to do a fair bit of complex number manipulations, so I have written this very basic code to multiply an array of complex numbers with a real number in a GPU kernel.

#include <complex>
#include <iostream>
#include <cmath>
#include "cuda.h"
#include "math.h"
#include "cuComplex.h"

#define n   5

using namespace std;

#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=true)
{
    if (code != cudaSuccess) 
    {
        fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
        if (abort) exit(code);
    }
}

__global__ void func( double *s, cuDoubleComplex *j, cuDoubleComplex *calc ) {

    int tid = blockIdx.x;

    calc[tid] = cuCmul(j[tid], make_cuDoubleComplex(*s, 0));

}

int main( void ) {


    cuDoubleComplex calc[n+1], *dev_j, *dev_calc;
    double *dev_s, s[n+1] = { 2.0, 2.0, 2.0, 2.0, 2.0 };
    //complex<double> j[n+1]
    cuDoubleComplex j[n+1];

    for (int i = 1; i <= n; i++) {
        j[i] = make_cuDoubleComplex(0, 5);
        cout << "\nJ cout = " << cuCreal(j[i]) << ", " << cuCimag(j[i]);
    }

    // allocate the memory on the GPU
    cudaMalloc( (void**)&dev_s, (n+1) * sizeof(double) );
    cudaMalloc( (void**)&dev_j, (n+1) * sizeof(double) );
    cudaMalloc( (void**)&dev_calc, (n+1) * sizeof(double) );

    cudaMemcpy( dev_s, s, (n+1) * sizeof(double), cudaMemcpyHostToDevice );
    cudaMemcpy( dev_j, j, (n+1) * sizeof(double), cudaMemcpyHostToDevice );

    func<<<n,1>>>( dev_s, dev_j, dev_calc );
    //kernel<<<1,1>>>(a_d);
    gpuErrchk( cudaPeekAtLastError() );
    gpuErrchk( cudaMemcpy(calc, dev_calc, (n+1) * sizeof(double), cudaMemcpyDeviceToHost) );

    //cudaMemcpy( calc, dev_calc, (n+1) * sizeof(double), cudaMemcpyDeviceToHost );

    for (int i = 1; i <= n; i++) {
        cout << "\nCALC cout = " << cuCreal(calc[i]) << ", " << cuCimag(calc[i]);
    }

    return 0;
}

The final answer is wrong, and I have also identified a few other places where I am not getting expected values.

1) I expected a complex double array of (0, 5i) for all elements of 'j' after the following line of code. However, I am getting all 0s. Why is that?

j[i] = make_cuDoubleComplex(0, 5); 

2) Why can't I print my array using cout? The line of code shown below gives the following error : no operator "<<" matches these operands. How can I fix this without using with printf?

cout << "\nJ = " << j[i];

3) The GPU function 'func' which should give out an array of (0, 10i) as the final answer is giving random values such as these:

CALC = -1.#QNAN0
CALC = -1.#QNAN0
CALC = -9255963134931783100000000...000.. etc
CALC = -9255963134931783100000000...000.. etc

4) For my actual research, the complex array 'j' will be given in the format of complex(double) and not cuDoubleComplex. Can I do similar operations to 'j' array of complex(double)s using the function 'func'? If not, what are my options?

I think I've explained myself well, but feel free to ask any follow-up questions. New to C++ as well as CUDA so be nice :D

user2550888
  • 293
  • 1
  • 9
  • 18
  • What happened when you tried to multiply a *single* complex number by a real number in a GPU kernel? – Beta Jul 04 '13 at 19:24

1 Answers1

1

When writing CUDA code, especially when you're learning or having difficulty (things aren't working the way you expect) you should always do cuda error checking on all CUDA API calls and kernel calls.

I don't think there's actually any CUDA functional errors in your code (good job!) but it's worth pointing out.

Most of your questions are due to the fact that you're not printing out the type cuDoubleComplex properly. Your printf statements are specifying a float format parameter (%f) but you are not passing a float value (you're passing a cuDoubleComplex value). That won't work, and printf will behave strangely when you do that, without giving any error indication.

Instead, try something like this:

printf("\nJ = %f, %f", cuCreal(j[i]), cuCimag(j[i])); 

These functions (cuCreal and cuCimag) return the real and imaginary parts of cuComplex numbers, and they return them as an appropriate type, float or double, and in this case the implicit cast from double to float is OK for what you're doing and can be handled by printf(although it's not really good programming practice -- instead use the correct printf format specifier for a double value).

If you make that change for both of your printf statements, I think you'll get expected results -- at least I did when I ran your code. If you still get garbage, then your CUDA GPU may not be working correctly, and here is where doing that CUDA error checking I mentioned will help you discover what the problem is.

Regarding your questions concerning cout, the answer is roughly equivalent to my explanation for what is going on with printf. cout doesn't understand the type cuDoubleComplex and so throws an error. If you want to fix it without using printf, convert your cuDoubleComplex to its individual real and imaginary parts, represented by float or double, using the conversion functions I've indicated in the printf statement above.

Regarding your last question, it should not be difficult to convert your complex data to cuDoubleComplex type. Write a conversion function to do it based on the utilities you have in cuComplex.h There are back-door ways around this, but they're not good programming practice.

EDIT: In response to follow up questions, there were two more errors in the code currently posted.

  1. dev_j and dev_calc are of type cuDoubleComplex but you were doing cudaMalloc and cudaMemcpy on these quantities as if they were of size double. In the following code, I changed those sizeof(double) entries to sizeof(cuDoubleComplex).
  2. Your indexing in general was a little odd for C and C++. Usually indices start at zero. You had an indexing problem where the last element was not getting computed properly. I changed all indexing to be zero-based.

Here's a modification of your code that works for me:

//#include <complex>  // not necessary for this code
#include <iostream>
#include <cmath>
//#include "cuda.h"  // not necessary when compiling with nvcc
#include "math.h"
#include "cuComplex.h"

#define n   5

using namespace std;

#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=true)
{
    if (code != cudaSuccess)
    {
        fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
        if (abort) exit(code);
    }
}

__global__ void func( double *s, cuDoubleComplex *j, cuDoubleComplex *calc ) {

    int tid = blockIdx.x;

    calc[tid] = cuCmul(j[tid], make_cuDoubleComplex(*s, 0));

}

int main( void ) {


    cuDoubleComplex calc[n+1], *dev_j, *dev_calc;
    double *dev_s, s[n] = { 2.0, 2.0, 2.0, 2.0, 2.0 };
    //complex<double> j[n+1]
    cuDoubleComplex j[n];

    for (int i = 0; i < n; i++) {
        j[i] = make_cuDoubleComplex(0, 5);
        cout << "\nJ cout = " << cuCreal(j[i]) << ", " << cuCimag(j[i]);
    }

    // allocate the memory on the GPU
    cudaMalloc( (void**)&dev_s, (n) * sizeof(double) );
    cudaMalloc( (void**)&dev_j, (n) * sizeof(cuDoubleComplex) );
    cudaMalloc( (void**)&dev_calc, (n) * sizeof(cuDoubleComplex) );

    cudaMemcpy( dev_s, s, (n) * sizeof(double), cudaMemcpyHostToDevice );
    cudaMemcpy( dev_j, j, (n) * sizeof(cuDoubleComplex), cudaMemcpyHostToDevice );

    func<<<n,1>>>( dev_s, dev_j, dev_calc );
    //kernel<<<1,1>>>(a_d);
    gpuErrchk( cudaPeekAtLastError() );
    gpuErrchk( cudaMemcpy(calc, dev_calc, (n) * sizeof(cuDoubleComplex), cudaMemcpyDeviceToHost) );

    //cudaMemcpy( calc, dev_calc, (n+1) * sizeof(double), cudaMemcpyDeviceToHost );

    for (int i = 0; i < n; i++) {
        cout << "\nCALC cout = " << cuCreal(calc[i]) << ", " << cuCimag(calc[i]);
    }

    return 0;
}
Community
  • 1
  • 1
Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • The data layout of the complex types in cuComplex.h is compatible with the data layout used for built-in complex types in C/C++/Fortran codes on the host, so no explicit conversion is necessary. Basically the complex types are structs where the real portion is followed by the imaginary portion, and both are IEEE-754 single-precision or double-precision numbers. Other software environments may store complex data in SOA-like arrangements where real and imaginary components are separated into different; in those cases reshuffling of the data will be required. – njuffa Jul 04 '13 at 20:10
  • I changed the printf to: cout << "\nCALC cout = " << cuCreal(calc[i]) << ", " << cuCimag(calc[i]); This gives the correct answer for 'j' but not for 'calc'.. I also included the error checking code and that isn't returning any errors either (not that I am aware of) – user2550888 Jul 04 '13 at 20:21
  • It's likely there is something wrong with your GPU setup. Are you able to run other GPU codes? Edit your original question with the new code you have showing the error checking you put in, and I will take a look. Like I said, with only the changes to the `printf` statements, I was able to get your code working. – Robert Crovella Jul 04 '13 at 20:23
  • I've updated the code now. My CALC cout values are: -5.69507e+303, -5.68301e+303 -5.69507e+303, -5.68301e+303 -9.25596e+061, -9.25596e+061 -9.25596e+061, -9.25596e+061 -9.25596e+061, -9.25596e+061 Yup, I've managed to run the CUDA sample codes as well as the codes from the CUDA by Example book successfully – user2550888 Jul 04 '13 at 20:37
  • 1
    This looks suspicious: `cudaMalloc( (void**)&dev_calc, (n+1) * sizeof(double) );`. Try `sizeof(dev_calc[0])` instead, that way the size computation is always self-consistent, not matter what type `dev_cal` has. Same bug is also in the calls to `cudaMemcpy()`. – njuffa Jul 04 '13 at 20:43
  • @RobertCrovella This is really strange because I'm still getting meaningless values for the final result. No matter what I do inside the global function the calc results don't seem to change either. Am I printing out the correct thing? cudaMemcpy(calc, dev_calc, (n) * sizeof(cuDoubleComplex), cudaMemcpyDeviceToHost); for (int i = 0; i < n; i++) { cout << "\nCALC cout = " << cuCreal(calc[i]) << ", " << cuCimag(calc[i]); } – user2550888 Jul 05 '13 at 13:12
  • @njuffa I'm afraid changing the sizeof(double) to sizeof(dev_calc[0]) didn't seem to have an effect either. – user2550888 Jul 05 '13 at 13:15
  • are you actually running the code I have now posted in my answer? – Robert Crovella Jul 05 '13 at 13:29
  • I am, which is why its really strange.. I changed the global function so that it would just copy the 'j' array into 'calc' (calc[tid] = j[tid]) but this gives the same results (-5.69507e+303, -5.68301e+303 etc) – user2550888 Jul 05 '13 at 13:37
  • run your application with cuda-memcheck. I assume you are on linux. something like `cuda-memcheck ./myapp` Report back what it indicates, whether any errors or not. You haven't really done the cuda error checking correctly in your app (and I didn't fix it) so this will help us discover if there is an issue. – Robert Crovella Jul 05 '13 at 13:41
  • I'm on XP, VS 2010, CUDA 5.0 and a C1060 card. Is it cuda-memcheck["programme name"] in command prompt after going to the correct folder? I am getting "nothing to check" or "not a recognisable command" – user2550888 Jul 05 '13 at 13:57
  • What is the compile command line that visual studio is using to compile your program? what happens if you type `cuda-memcheck --version` at a command prompt? – Robert Crovella Jul 05 '13 at 14:25
  • memcheck --version gives: version 5.0 (17). cuda-memcheck output print screen - http://postimg.org/image/vkspynddt/ Im not sure how to run my existing code with memcheck – user2550888 Jul 05 '13 at 15:28
  • What is the name of the executable you have built with visual studio? If the name of your executable is `myapp.exe` what happens when you type `cuda-memcheck myapp.exe` ? – Robert Crovella Jul 05 '13 at 15:35
  • VS isn't creating an executable :S There is a .cu file ("p_basic.cu") in the project folder, when I tried running cuda-memcheck p_basic.exe (even though it isn't visible in the folder) and cuda-memcheck p_basic.cu it says "Error: Could not run p_basic" – user2550888 Jul 05 '13 at 15:49
  • If you are running a program, VS is creating an executable. I guarantee it. You just haven't located it. If you inspect the output window (at the bottom of the display) in visual studio closely, you will see the name of the executable file generated as well as the compile command that I asked for earlier. If you don't understand any of this, you probably need to learn more about Visual Studio. – Robert Crovella Jul 05 '13 at 15:53
  • I was looking for a .exe with the solution instead of project name, sorry! The memcheck results - http://postimg.org/image/auqry5bml/ – user2550888 Jul 05 '13 at 16:01
  • Can you give me the compile command line? It should have `nvcc` somewhere in it. – Robert Crovella Jul 05 '13 at 16:07
  • Is this it? 1> C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0\bin\nvcc.exe" -gencode=arch=compute_10,code=\"sm_10,compute_10\" --use-local-env --cl-version 2010 -ccbin "c:\Program Files\Microsoft Visual Studio 10.0\VC\bin" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0\include" -G --keep-dir "Debug" -maxrregcount=0 --machine 32 --compile -g -D_MBCS -Xcompiler "/EHsc /W3 /nologo /Od /Zi /RTC1 /MDd " -o "Debug\p_basic.cu.obj" – user2550888 Jul 05 '13 at 16:09
  • Yes, that is it. Your project is set up to compile for a `sm_10` device, which is OK, except when you are using `double` values. `sm_10` devices don't support `double`. The way to fix it is to tell VS to compile your project for an `sm_13` device, which is what your C1060 is. `sm_13` supports `double` Even with `sm_10` I still expect the code to run correctly, but I'm not sure about that with the presence of the `--machine 32` switch. Basically 32bit windows XP and C1060 are very old, so my knowledge is fuzzy. Can you switch the project to use `sm_13`, rebuild, and re-test ? – Robert Crovella Jul 05 '13 at 16:27
  • I have a Quadro 600 as well, maybe sm_10 is for that? Can you tell me how to switch to sm_13 please? – user2550888 Jul 05 '13 at 16:32
  • Can you start to research some of these questions for yourself? Try [this link](http://stackoverflow.com/questions/14411435/how-to-set-cuda-compiler-flags-in-visual-studio-2010/14413360#14413360) -- use the suggestion in the answer, not in the question (which doesn't work). – Robert Crovella Jul 05 '13 at 17:20
  • I switched to a different machine and the code worked on it. Will try switching to sm_13 on Monday as I need to get it to work on the other machine as well. Thanks for the comprehensive answers again! – user2550888 Jul 05 '13 at 17:55
  • Tested and worked. compute_13 and sm_13 for Tesla GPUs - http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#options-for-steering-gpu-code-generation – user2550888 Jul 08 '13 at 10:31