Cuda calculations stops with no results

Question

I'm playing with CUDA, and trying to calculate realistic neuron model on GPU. It is my second day with CUDA and probably I did something completely stupid.

My system:

$ nvidia-smi 
Wed Aug  1 18:03:53 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.45                 Driver Version: 396.45                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro K600         Off  | 00000000:01:00.0  On |                  N/A |
| 25%   50C    P8    N/A /  N/A |    597MiB /   974MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1235      G   /usr/lib/xorg/Xorg                           232MiB |
|    0      2496      G   /usr/bin/krunner                               1MiB |
|    0      2498      G   /usr/bin/plasmashell                         102MiB |
|    0      2914      G   ...-token=1063E9B61C5D53298A4DC8A65D896440   215MiB |
|    0      4817      G   /usr/bin/kwin_x11                             41MiB |
+-----------------------------------------------------------------------------+

$ cat /proc/driver/nvidia/version 
NVRM version: NVIDIA UNIX x86_64 Kernel Module  396.45  Thu Jul 12 20:49:29 PDT 2018
GCC version:  gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.10)

According to spec, I have one SM with 192 cores and 1024 threads per block.

Now what I want to achieve is to run a simulation of (say) 64 neurons in parallel. Each neuron computes 3 differential equations interactively, using the Euler method (everything is simple at this moment). It is just a test. For the performance test, I want to compute 1 minute of model time with 0.01 ms time step. Here the code:

#include <stdio.h>
#include <iostream>
#include <math.h>


#define I   7
#define gna 35.
#define gk  9.
#define gl  0.1
#define ena 55.
#define ek  (-90.)
#define el  (-65.)
#define dt  0.01

__global__
void run(float *v, float *h, float *n)
{
    int i = threadIdx.x;
    printf("DB>> i=%d v=%g\n",i,v[i]);
    float minf, ninf, hinf, ntau, htau, a, b;
    for(unsigned long t   = 0; t<6000000l;    ++t){
    //for(unsigned long t   = 0; t<1000000l;    ++t){
        a = 0.1*(v[i]+35.)/(1.0-exp(-(v[i]+35.)/10.)) ;
        b = 4.0*exp(-(v[i]+60.)/18.);
        minf = a/(a+b);

        a = 0.01*(v[i]+34.)/(1.0-exp(-(v[i]+34.)/10.));
        b = 0.125*exp(-(v[i]+44.)/80.);
        ninf =  a/(a+b);
        ntau = 1./(a+b);

        a = 0.07*exp(-(v[i]+58.)/20.);
        b = 1.0/(1.0+exp(-(v[i]+28.)/10.));
        hinf =  a/(a+b);
        htau = 1./(a+b);

        n[i] += dt*(ninf - n[i])/ntau;
        h[i] += dt*(hinf - h[i])/htau;
        v[i] += dt*(-gna*minf*minf*minf*h[i]*(v[i]-ena)-gk*n[i]*n[i]*n[i]*n[i]*(v[i]-ek)-gl*(v[i]-el)+I);
        //printf("%g %g\n",dt*t,v);
    }
    printf("DB>> i=%d v=%g\n",i,v[i]);
}

int main(void)
{
  int N = 64;
  float *v, *h, *n;

  // Allocate Unified Memory – accessible from CPU or GPU
  cudaMallocManaged(&v, N*sizeof(float));
  cudaMallocManaged(&h, N*sizeof(float));
  cudaMallocManaged(&n, N*sizeof(float));

    fprintf(stderr,"STEP 1\n");
  // initialize arrays on the host
  for (int i = 0; i < N; i++) {
    v[i] = -63.f;
    h[i] = n[i] = 0.f;
  }

  fprintf(stderr,"STEP 2\n");

  run<<<1, N>>>(v, h, n);

  // Wait for GPU to finish before accessing on host
  cudaDeviceSynchronize();
  fprintf(stderr,"STEP 3\n");

  // Free memory
  cudaFree(v);
  cudaFree(h);
  cudaFree(n);

  return 0;
}

This code (seems) crashes, with no second printf in the run function. However, if I reduce the number of steps to 1000000l (see the commented line in the run function), it works, prints pre- and post printf in run and shows more or less ok results.

Why is that?

Your GPU is configured to host a display (`/usr/lib/xorg/Xorg`). Whether it is actually hosting a display or not is irrelevant. When a GPU is hosting a display, and you haven't taken any steps to prevent it, a display watchdog timer is running. If a GPU kernel runs for more than about 2 seconds (the number may vary between windows and linux), the display watchdog will reset the GPU. You should rule this out before proceeding with further debug. I think it's possible that as you reduce the number of steps, your kernel time drops enough to avoid the watchdog. — Robert Crovella, Aug 01 '18 at 23:22
@RobertCrovella so I should completely shut down graphics, right? — rth, Aug 01 '18 at 23:24
Yes, if you disable the X system, or remove your GPU from X configuration, it should eliminate the watchdog effect. You can confirm this by running the `deviceQuery` app on your GPU and looking at the line that refers to "runtime limit on kernels". You want that to say "No". Right now it says "Yes" Also, [this document](http://nvidia.custhelp.com/app/answers/detail/a_id/3029/~/using-cuda-and-x) may be worth a read, especially if you want to "prevent a reset" while leaving the display active. However that method can lead to instability on long running kernels. — Robert Crovella, Aug 01 '18 at 23:25
By the way, for a "performance test", a kernel launch of `<<<1,64>>>` is a fairly limiting way to use a GPU, even one as small as your K600. — Robert Crovella, Aug 01 '18 at 23:30
@RobertCrovella you are completely right! please post the answer, I’ll happy to accept it. I’ll highly appreciate any suggestions on test performance on gpu. — rth, Aug 01 '18 at 23:35
Even on your small GPU, you would want to aim for ~2000 threads to keep the GPU busy, so something like `<<<4,512>>>` For a more flexible, scalable code, you would want to aim for 10000+ threads in your kernel launch. — Robert Crovella, Aug 01 '18 at 23:45

Cuda calculations stops with no results

0 Answers0