1

I am trying to compile and run the following code on an Nvidia P100. I'm running CentOS 6.9, Driver version 396.37 and CUDA-9.2. It appears that these driver/cuda versions are compatible.

#include <stdio.h>
#include <cuda_runtime_api.h>
int main(int argc, char *argv[])
{
    // Declare variables
    int * dimA = NULL; //{2,3};
    cudaMallocManaged(&dimA, 2 * sizeof(float));
    dimA[0] = 2;
    dimA[1] = 3;
    cudaDeviceSynchronize();
    printf("The End\n");

    return 0;
}

It fails with a segmentation fault. When I compile with nvcc -g -G src/get_p100_to_work.cu and run the core file (cuda-gdb ./a.out core.277512), I get

Reading symbols from ./a.out...done.
[New LWP 277512]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `./a.out'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000000000040317d in main (argc=1, argv=0x7fff585da548) at src/get_p100_to_work.cu:71
71      dimA[0] = 2;
(cuda-gdb) bt full
#0  0x000000000040317d in main (argc=1, argv=0x7fff585da548) at src/get_p100_to_work.cu:71
        dimA = 0x0
(cuda-gdb)

When I run this code on an NVidia K40, the code runs without error.

QUESTION :

How do I get my code to run on the P100? It seems from this tutorial, this code should run.

irritable_phd_syndrome
  • 4,631
  • 3
  • 32
  • 60
  • That code works for me with the same driver and CUDA toolkit version on a Ubuntu box. Can you run anything successfully on the P100? – talonmies Oct 01 '18 at 15:21
  • I can run https://devblogs.nvidia.com/easy-introduction-cuda-c-and-c/ . On the P100 the `Max error: 2.000000`. On the K40 the `Max error: 0.000000`. – irritable_phd_syndrome Oct 01 '18 at 16:42
  • 1
    So in other words, your CUDA installation is broken or you card is broken. I see no programming question here, you might want to try asking somewhere else – talonmies Oct 01 '18 at 16:46
  • any time you are having trouble with a CUDA code its good practice to do proper CUDA error checking. I usually recommend that before asking others for help. The error checking would likely indicate in this case something wrong with the CUDA install. – Robert Crovella Oct 01 '18 at 19:29
  • Actually, I was trying to do the proper error checking described https://stackoverflow.com/a/14038590/4021436, but still getting non-descriptive segmentation faults. I stripped it out b/c it was failing at the `gpuAssert()`. – irritable_phd_syndrome Oct 01 '18 at 21:55
  • 1
    seg faults in managed memory usage often occur because the cudaMallocManaged call fails and returns an error. If you don't do any error checking and just proceed with writing to the area you expected to be allocated, you often get a seg fault. It's impossible to diagnose your situation exactly, but I wouldn't strip out error checking. If the error checking is failing at the `gpuAssert()` statement, that is a pretty good indication that there is a fundamental problem. Leaving that information out of your question doesn't help anyone to diagnose the issue. – Robert Crovella Oct 02 '18 at 04:17
  • I ended up having to reinstall the driver and now it works. – irritable_phd_syndrome Oct 02 '18 at 17:01
  • @irritable_phd_syndrom: Make that an answer and accept it please. – einpoklum Oct 03 '18 at 06:30

1 Answers1

2

Previously, I had cloned an image of a GPU node with a 2 K40's in it. I then put that image on a node with 2 - P100's in it. I suspect that when installing the driver on the K40 node, there is a configuration specific to the graphics cards on the machine (which is makes sense). This configuration was not compatible with the P100. Since the driver on the P100 machine was basically corrupted, this would explain why my code failed so cataclysmically.

Solution : I ended up having to reinstall the driver and now it works.

irritable_phd_syndrome
  • 4,631
  • 3
  • 32
  • 60