0

I am debugging a MPI-based CUDA program with DDT. My code aborts when the CUDA runtime library (libcudart) throws an exception in the (undocumented) function cudaGetExportTable, when called from cudaMalloc and cudaThreadSynchronize (UPDATED: using cudaDeviceSynchronize gives the same error) in my code.

Why is libcudart throwing an exception (I am using the C API, not the C++ API) before I can detect it in my code with its cudaError_t return value or with CHECKCUDAERROR?

(I'm using CUDA 4.2 SDK for Linux.)

Output:

Process 9: terminate called after throwing an instance of 'cudaError_enum'
Process 9: terminate called recursively

Process 20: terminate called after throwing an instance of 'cudaError'
Process 20: terminate called recursively

My code:

cudaThreadSynchronize();
CHECKCUDAERROR("cudaThreadSynchronize()");

Other code fragment:

const size_t t;  // from argument to function
void* p=NULL;
const cudaError_t r=cudaMalloc(&p, t);
if (r!=cudaSuccess) {
    ERROR("cudaMalloc failed.");
}

Partial Backtrace:

Process 9:
cudaDeviceSynchronize()
-> cudaGetExportTable()
   -> __cxa_throw

Process 20:
cudaMalloc()
-> cudaGetExportTable()
   -> cudaGetExportTable()
      -> __cxa_throw

Memory debugging errors:

Processes 0,2,4,6-9,15-17,20-21:
Memory error detected in Malloc_cuda_gx (cudamalloc.cu:35):
dmalloc bad admin structure list. 

This line is the cudaMalloc code fragment shown above. Also:

Processes 1,3,5,10-11,13-14,18-19,23:
Memory error detected in vfprintf from /lib64/libc.so.6:
dmalloc bad admin structure list.

Also, when running on 3 cores/gpus per node instead of 4 gpus per node, dmalloc detects similar memory errors, but when not in debug mode, the code runs perfectly fine with 3 gpus per node (as far as I can tell).

BenWibking
  • 11
  • 3
  • How many MPI processes are you running when this happens? – talonmies Jul 24 '12 at 17:39
  • I'm running with 24 processes across 6 nodes (4 cores and gpus per node). – BenWibking Jul 24 '12 at 18:14
  • `cudaThreadSynchronize` is technically deprecated. What happens if you try `cudaDeviceSychronize` instead? – talonmies Jul 24 '12 at 18:29
  • Using `cudaDeviceSynchronize` gives the exact same error. – BenWibking Jul 24 '12 at 19:01
  • Ok. `cudaGetExportTable` is an internal context symbol management routine which should never fail the way you are seeing. So either you have found a bug (perhaps try the cuda 5 beta and newer version of the runtime and driver), or you code has corrupted something with a buffer overrun. – talonmies Jul 25 '12 at 07:41
  • That's good to know. Since I'm running this on a shared cluster, I'd prefer to eliminate the latter possibility first. Would running with `cuda-memcheck` reliably detect a buffer overrun in this case? – BenWibking Jul 25 '12 at 15:03
  • @talonmies: I do not believe there are *any* behavior differences between cudaThreadSynchronize() and cudaDeviceSynchronize(). They just changed the name to be more descriptive. – ArchaeaSoftware Jul 25 '12 at 15:11
  • @BenWibking: I suspect a driver mismatch. Are you sure all the machines in the cluster are running a driver suitable for the CUDA runtime used by your application? – ArchaeaSoftware Jul 25 '12 at 15:12
  • I think the driver is correct. `nvcc` reports: `Cuda compilation tools, release 4.2, V0.2.1221` and `modinfo nvidia` reports `version: 295.41` on all of the nodes. – BenWibking Jul 25 '12 at 16:03
  • Memory debugging reveals: `Processes 0,2,4,6-9,15-17,20-21: Memory error detected in Malloc_cuda_gx (cudamalloc.cu:35): dmalloc bad admin structure list`. This line is the `cudaMalloc` code fragment shown above. Also: `Processes 1,3,5,10-11,13-14,18-19,23: Memory error detected in vfprintf from /lib64/libc.so.6: dmalloc bad admin structure list`. Any suggestions? – BenWibking Jul 25 '12 at 19:16

1 Answers1

1

Recompile with gcc. (I was using icc to compile my code.)

When you do this, the exception appears when debugging, but continuing past it, I get real CUDA errors:

Process 9: gadget_cuda_gx.cu:116: ERROR in gadget_cuda_gx.cu:919: CUDA ERROR:   cudaThreadSynchronize(): unspecified launch failure
Process 20: cudamalloc.cu:38: ERROR all CUDA-capable devices are busy or unavailable, cudaMalloc failed to allocate 856792 bytes = 0.817101 Mb

Valgrind reveals no memory corruption or leaks in my code (either compiling with gcc or icc), but does find a few leaks in libcudart.

UPDATE: Still not fixed. Appears to be the same problem reported in answer #2 to this thread: cudaMemset fails on __device__ variable. The runtime isn't working like it should, it seems...

Community
  • 1
  • 1
BenWibking
  • 11
  • 3
  • That was a pretty huge omission - icc isn't supported! The unspecified launch failure probably means out of bounds memory access in your kernel somewhere. You can use cuda-memcheck standalone or inside cuda-gdb to get more diagnostics. – talonmies Jul 28 '12 at 07:10
  • Ah, good to know. I discovered that the code that has been passed down to me uses variable scoping semantics incorrectly. However, I cannot find an idiomatic way to reproduce the correct behavior. The relevant device function declares and returns a `dim3 result;` which obviously goes out of scope upon return. However, the `result` variable should be thread-local (it stores the thread, block info) and modifiable, which does not appear to be possible in CUDA...? – BenWibking Jul 30 '12 at 22:12
  • Ok, so it turns out neither of these was the problem. (I upgraded to CUDA 5 and now the error reporting is working better.) I can reproduce the error I'm getting with `cudaGetSymbolAddress` from some code that you posted on another thread: http://stackoverflow.com/questions/9996563/cudamemset-fails-on-device-variable. So it seems that something is broken... – BenWibking Aug 30 '12 at 02:13