Code running perfectly on host, put in a kernel, fails for mysterious reasons

Question

I have to port a pre-existing “host-only” backpropagation implementation to CUDA. I think the nature of the algorithm doesn’t matter here, so I won’t give much explanation about the way it works. What I think matter though, is that it uses 3-dimensional arrays, whose all three dimensions are dynamically allocated. I use VS2010, with CUDA 5.0. And my device is a 2.1. The original host-only code can be downloaded here → http://files.getwebb.org/view-cre62u4d.html

Main points of the code:

patterns from adult.data are loaded into memory, using the Data structure, present in “pattern.h”.
several multi-dimensional arrays are allocated
the algorithm is ran over the patterns, using the arrays allocated just before.

If you want to try to run the code don’t forget to modify the PATH constant at the beginning of kernel.cu. I also advise you to use “2” layers, “5” neurons, and a learning rate of “0.00001”. As you can see, this work perfectly. The “MSE” is improving. For those who have no clue about what does this algorithms, let’s simply say that it learns how to predict a target value, based on 14 variables present in the patterns. The “MSE” decrease, meaning that the algorithm makes less mistakes after each “epoch”.

I spent a really long time trying to run this code on the device. And I’m still unsuccessful. Last attempt was done by simply copying the code initializing the arrays and running the algorithm into a big kernel. Which failed again. This code can be downloaded there → http://files.getwebb.org/view-cre62u4c.html

To be precise, here are the differences with the original host-only code:

f() and fder(), which are used by the algorithm, become device functions.
parameters are hardcoded: 2 layers, 5 neurons, and a learning rate of 0.00001
the “w” array is initialized using a fixed value (0.5), not rand() anymore
a Data structure is allocated in device’s memory, and the data are sent in device’s memory after they have been loaded from adult.data in host’s memory

I think I did the minimal amount of modifications needed to make the code run in a kernel. The “kernel_check_learningData” kernel, show some informations about the patterns loaded in device’s memory, proving the following code, sending the patterns from the host to the device, did work:

Data data;
Data* dev_data;
int* dev_t;
double* dev_x;
...
input_adult(PathFile, &data);
...
cudaMalloc((void**)&dev_data, sizeof(Data));
cudaMalloc((void**)&dev_t, data.N * sizeof(int));
cudaMalloc((void**)&dev_x, data.N * data.n * sizeof(double));
// Filling the device with t and x's data.
cudaMemcpy(dev_t, data.t, data.N * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_x, data.x, data.N * data.n * sizeof(double), cudaMemcpyHostToDevice);
// Updating t and x pointers into devices Data structure.
cudaMemcpy(&dev_data->t, &dev_t, sizeof(int*), cudaMemcpyHostToDevice);
cudaMemcpy(&dev_data->x, &dev_x, sizeof(double*), cudaMemcpyHostToDevice);
// Copying N and n.
cudaMemcpy(&dev_data->N, &data.N, sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(&dev_data->n, &data.n, sizeof(int), cudaMemcpyHostToDevice);

It apparently fails at the beginning of the forward phase, when reading the “w” array. I can’t find any explanation for that.

I see two possibilities:

the code sending the patterns into device's memory is bugged, despite the fact it seems to work properly, and provoke a bug way further, when beginning the forward phase.
the CUDA API is not behaving like it should!

I’m desperately searching for my mistake for a very long time. So I wondered if the community could provide me with some help.

Thanks.

You're not doing kernel [error checking](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api) properly. You must use `cudaGetLastError` or similar immediately after the kernel launch to collect all possible errors. It's not enough to simply call cudaDeviceSynchronize() and look at the return value. Why don't you fix that and see if you are getting errors from the kernel launches. If that looks OK, then run your code through cuda-memcheck to find out if you have any bad memory references in your kernels. — Robert Crovella, Apr 27 '13 at 19:52
Could you define exactly what is the "failure" you are seeing? I tried running your code and got the output shown [here](http://pastebin.com/bj6s4Wzc). The MSE is improving (somewhat) over the 10 epochs. What is the failure *exactly*? It runs slowly because your code is using only 1 block of 1 thread - not using the parallel machine at all, but I think you understand that based on your posting. — Robert Crovella, Apr 27 '13 at 20:33
Thank you for your help. Waw the code seems to work in your environment! I just get "...failed with error "unknown error"" and no MSE at all! Yes, of course, I don't want to elaborate a parallel version before solving this issue. How do you use cudaGetLastError? Something like that? backpropagation<<<1,1>>>(dev_data); cudaerr = cudaGetLastError(); printf(">>>>> backpropagation kernel error \"%s\".\n", cudaGetErrorString(cudaerr)); I'm reading the documentation about cuda-memcheck... — Yugo Amaryl, Apr 27 '13 at 21:41
Use the link in my first comment to explain how to check cuda kernel errors. When I run your code, cuda-memcheck reports no errors. Since your kernel takes so long, if you are running it in a windows environment, you may hit a windows TDR event (google that.) Try running just a single iteration of your epoch loop and see if you can get any results that way. Which GPU are you running on, exactly? Is it hosting a windows display also? — Robert Crovella, Apr 27 '13 at 21:49
When I run your code, with a single iteration of the epoch loop, the backpropagation kernel takes about 2-3 seconds. This may be long enough to trigger a TDR event. So you may want to look at ways to reduce the complexity to shorten the run time, just to prove to yourself that the code is doing something. — Robert Crovella, Apr 27 '13 at 21:54
Ok, I use the code given in the topic you gave me the link of. I still only get an "unknown error". About the TDR event, the thing is that this error is displayed instantly, and I get no "Display driver stopped..." message from Windows... — Yugo Amaryl, Apr 27 '13 at 22:53
I've posed several questions that you haven't answered. Can you describe your system setup completely? What gpu are you running on? Is it hosting a windows display? What GPU driver is installed? Do you get the unknown error on the first kernel call (check) or the 2nd (backpropagation)? Can you run any other cuda codes, such as the cuda samples bandwidthTest and deviceQuery, successfully on your setup? — Robert Crovella, Apr 27 '13 at 23:30
Ho, sorry. No, the check is all ok. It's the backprop that fails. And, actually (thanks to a few printf) it seems the program produce this error even before it finished the forward phase for the first pattern of the first epoch. Yes I can run other CUDA codes. I ran several samples. My device is a 610M. I also got the exact same problem with another 2.1 device that I can't give you any detail about right now (need a remote access to a machine that is currently offline). Both machines run Windows 7. Here, with my 610M, the driver version is 306.94. — Yugo Amaryl, Apr 28 '13 at 00:05
what happens when you run the program with `cuda-memcheck`? Are you running the exact code you have posted in the zip file ? — Robert Crovella, Apr 28 '13 at 00:38
Yes it's the exact same code. I don't understand how to use cuda-memcheck with Visual Studio. But I have Nsight, which says: Exception = Misaligned Address | PC = 0x00022248 | FunctionRelativePC = 0x00000c48 That's not of much help to me... — Yugo Amaryl, Apr 28 '13 at 00:52
I'm running in a linux environment, Quadro5000 (2.5GB Fermi cc 2.0), RHEL 5.5, CUDA 5.0. I actually think that you have narrowed things down quite a bit by discovering that the problem is in the forward path, and also the misalignment indication. I was in the process of trying to set up a win 7 64 bit system with CUDA 5.0 and Quadro1000M (cc 2.1) with visual studio 2008 and nsight VSE (I don't have a paid copy of VS2010), to see if I could more closely match your setup, but I've not finished setting it up yet. — Robert Crovella, Apr 28 '13 at 15:31
I'm slowly uploading my VS2010 ultimate iso, with its serial, on my google drive if you want it... — Yugo Amaryl, Apr 28 '13 at 17:38
I got my machine set up. I was able to reproduce your observation. When I switched to building a 64 bit app (have to change project type as well as machine setting in CUDA settings) then I no longer have that issue, and I now get the windows TDR because the kernel is taking too long, even with MAXEPOCHS=1. I'm not sure yet why building it as a 32-bit app causes this problem. My original linux test was 64 bit. — Robert Crovella, Apr 30 '13 at 03:18

score 1 · Accepted Answer · answered Apr 30 '13 at 04:21

Here's the problem in your code, and why it works in 64 bit machine mode but not 32 bit machine mode.

In your backpropagation kernel, in the forward path, you have a sequence of code like this:

/*
* for layer = 0
*/
for (i = 0; i < N[0]; i++) {    // for all neurons i of layer 0
a[0][i] = x[ data->n * pat + i];    // a[0][i] = input i
}

In 32 bit machine mode (Win32 project, --machine 32 is being passed to nvcc), the failure occurs on the iteration i=7 when the write of a[0][7] occurs; this write is out of bounds. At this point, a[0][7] is intended to hold a double value, but for some reason the indexing is placing us out of bounds.

By the way, you can verify this by simply opening a command prompt in the directory where your executable is built, and running the command:

cuda-memcheck test_bp

assuming test_bp.exe is the name of your executable. cuda-memcheck conveniently identifies that there is an out of bounds write occurring, and even identifies the line of source that it is occurring on.

So why is this out of bounds? Let's take a look earlier in the kernel code where a[0][] is allocated:

a[0] = (double *)malloc( N[0] * sizeof(double *) );
                                              ^ oops!!

a[0][] is intended to hold double data but you're allocating pointer storage. As it turns out, in a 64 bit machine the two types of storage are the same size, so it ends up working. But in a 32-bit machine, a double pointer is 4 bytes whereas double data is 8 bytes. So, in a 32-bit machine, when we index through this array taking data strides of 8 bytes, we eventually run off the end of the array.

Elsewhere in the kernel code you are allocating storage for the other "layers" of a like this:

a[layer] = (double *)malloc( N[layer] * sizeof(double) );

which is correct. I see that the original "host-only" code seems to contain this error as well. There may be a latent defect in that code as well.

You will still need to address the kernel running time to avoid the windows TDR event, in some fashion, if you want to run on a windows wddm device. And as I already pointed out, this code makes no attempt to use the parallel capability of the machine.

Wow! I can't believe I didn't see that mistake earlier! I checked and this bug actually was in the very original code that was given to me. I think my first mistake was to consider this original code as bug free. I read it wondering "why does it works" instead of "why does it seems to work". Thank you so much for your time and effort! I was slowly growing depressed! (I can't vote up, unfortunately, I lack reputation) — Yugo Amaryl, Apr 30 '13 at 12:10
About the TDR event, I already tuned some values in the Windows registry, to make this event appear after a much longer time. But, it does not really matter since this non-parallel version was a mere test. I'm now going to code a first parallel version. Thank you again! — Yugo Amaryl, Apr 30 '13 at 12:18

Code running perfectly on host, put in a kernel, fails for mysterious reasons

1 Answers1