cuda call fails in destructor

Question

I am new to CUDA and met a problem when writing a singleton/global variable using CUDA. The singleton allocates some cuda memory and tries to free it in the destructor. However, the destructor crashes with cudaError of 29 "driver shutting down".

By some search, I notice the reason might be that the singleton destructor is called after the program exits, when CUDA is already shutdown.

https://github.com/NVlabs/SASSI/issues/4 This link reports a similar issue when the cuda function is called in the destructor of a static member.

https://devtalk.nvidia.com/default/topic/457922/cudafree-crash-in-destructor-when-exit-is-called/ This link reports same question and an unclear solution.

Honestly I do not have much CUDA knowledge so I would like to ask for some detailed explanation and formal solution for this problem.

EDIT:

Thanks to @Robert Crovella's remind I did some tests to reproduce the problem. OK, I found that this problem happens in both singleton and global variables of a std::unordered_map or std::map which call cuda in its value object's destructor.

Working code, no std::map is used:

#include <iostream>
#include <map>

#define CUDA_CHECK(x) std::cerr << (x) << std::endl;

class cuda_user
{   
    char* data;
public:
    cuda_user() {
        std::cerr << "constr" << std::endl;
        CUDA_CHECK(cudaMalloc((void**)&data, 1024));
    }
    void foo() {
        std::cerr << "foo" << std::endl;
    };
    ~cuda_user() {
        std::cerr << "destr"  << std::endl;
        CUDA_CHECK(cudaFree(data));
    }
};

cuda_user cu;
int main()
{   
    cu.foo();
}

outputs:

constr
0
foo
destr
0

Crashed code, with same cuda_user clas, but std::map used:

#include <iostream>
#include <map>

#define CUDA_CHECK(x) std::cerr << (x) << std::endl;

class cuda_user
{   
    char* data;
public:
    cuda_user() {
        std::cerr << "constr" << std::endl;
        CUDA_CHECK(cudaMalloc((void**)&data, 1024));
    }
    void foo() {
        std::cerr << "foo" << std::endl;
    };
    ~cuda_user() {
        std::cerr << "destr"  << std::endl;
        CUDA_CHECK(cudaFree(data));
    }
};

std::map<int, cuda_user> map;
int main()
{   
    map[1].foo();
}

Outputs:

constr
0
foo
destr
29 << Error!

Update:

I am using gcc48 and nvcc75 on a CentOS 6.3

Have your singletons clean up before you leave main: ``int main() { { SingletonController singcont; /* .... */ } return 0; }`` If you have threads, you should also make sure they end before you leave the inner scope block in main. Then, the descructor of the ``SingletonController`` will have no races with other threads and you have a clean shutdown. Of course, most who do singletons, do not do it that way... — BitTickler, Mar 05 '16 at 15:07
Don't know anything about CUDA, but is there a particular reason why you need any memory to be freed milliseconds before the operating system eliminates your process and releases all memory anyway? I sense an unfounded fear of "memory leaks" here, but perhaps with CUDA things are different. — Christian Hackl, Mar 05 '16 at 16:01
Singletons are notoriously tricky and there are at least [two canonical implementations that have significant differences](http://stackoverflow.com/questions/13047526/difference-between-singleton-implemention-using-pointer-and-using-static-object). Can you give an simplified example of what you are using? I suspect your exact implementation could affect the viability of the approach suggested by @BitTickler. And if you don't intend to use a "cleanable" singleton design pattern, then the approach suggested by Christian Hackl is probably simplest: eliminate cuda calls from the destructor. — Robert Crovella, Mar 05 '16 at 16:14
@ChristianHackl The point of making sure your program shuts down without any leaks is to enable you to find leaks you do not want. We all know, that the OS cleans up anything used by the process. But we also all know that we want to produce quality code we can make assertive statements about. — BitTickler, Mar 05 '16 at 16:22
@BitTickler: So you are saying that the memory allocated by the Singleton prevents one from finding real leaks? — Christian Hackl, Mar 05 '16 at 16:26
@ChristianHackl Yes, in the same sense as a myriad of warnings in a compile run prevents you finding the warnings you are interested in. The same logics in another context. Have a 0 warnings policy and a warning showing up will get attention. Have a 0 leaks policy and your test harness can alert you of leaks it finds. — BitTickler, Mar 05 '16 at 16:28
@BitTickler: I understand your point, but I still think that letting a destructor of an object with static storage duration do anything important is a disadvantage which outweighs the advantage you have described. Having a 0-leaks policy will also prove difficult in C++ when even typical standard-library implementations don't clear up everything, see e.g. http://stackoverflow.com/questions/30376601/valgrind-memory-still-reachable-with-trivial-program-using-iostream — Christian Hackl, Mar 05 '16 at 16:34
@ChristianHackl Indeed. Having runtime libraries which produce leaks is one of the (many) reasons why I keep looking for a new language for serious embedded programming. And also agreed, that RAII is an illusion. As much as constructors should not do failable-non-trivial initialization, destructors come with a set of problems of their own. Over the years, I zeroed in on a 2 stage initialization approach. Especially in embedded, it is necessary to have the constructor do no heap operations. What can fail, will fail sooner or later. — BitTickler, Mar 05 '16 at 16:37
@BitTickler: Well, RAII is bad for static objects, yes. Perhaps one should more often stress the point that the idiom is really meant for objects with automatic storage duration. — Christian Hackl, Mar 05 '16 at 16:59
@BoLi: Neither of the versions you posted in your edit segfault for me when compiled in VS2012. And that is the nature of undefined behaviour, which is what your problem really is — talonmies, Mar 06 '16 at 09:45
@talonmies: I guess by segfault you mean a CUDA error code of 29, do you? I don't understand, and would like to know, which part of my code leads to any undefined behavior. I am compiling using gcc4.8 and nvcc7.5 on a centos6.3, as updated in the question. — Bo Li, Mar 06 '16 at 13:20
The placement of CUDA calls in a global object outside of `main` scope will lead to problematic behavior. See [here](http://stackoverflow.com/questions/24869167/cuda-calling-kernel-outside-main-function). Although that description mostly focuses on kernel calls in such a class/object, the hazard applies to any CUDA call, as you have discovered. I would suggest your question is arguably a duplicate of that one, at least insofar as you are requesting a description of what is happening, and now that your example is devoid of any singleton character. — Robert Crovella, Mar 06 '16 at 13:31
@BoLi: Yes, sorry for the terminology mistake. When you use the CUDA runtime API, the CUDA front silently emits a lot of boilerplate support code which has the same scope as your translation unit scope map/singleton/object. Whether your code will work comes down to whether the particular sequence of object instantiation and destruction means there is still an active CUDA context present or not when your classes call their contructor/destructor containing runtime API calls. The order of object instantiation and destruction isn't defined in the language -- this is effectively relying on UB. — talonmies, Mar 06 '16 at 13:32
To be clear, I should have said "The placement of CUDA calls in *constructors and destructors of* a global object outside of main scope will lead to problematic behavior. " Use of CUDA in other class methods may be possible (assuming e.g these methods don't get called by constructors/destructors, etc.) — Robert Crovella, Mar 06 '16 at 13:46
@talonmies thanks for the explanation and your "700th" answer. I might just do a workaround to avoid calling cuda in con/destructor. — Bo Li, Mar 06 '16 at 14:00
@RobertCrovella thanks for the explanation and the question link. I will remember to avoid this next time.:) — Bo Li, Mar 06 '16 at 14:01

score 5 · Accepted Answer · edited May 23 '17 at 12:07

[Expanding comments into a summary answer]

Your code is unknowingly relying on undefined behaviour (the order of destruction of translation unit objects) and there is no real workaround other than to explicitly control and lifespan of objects containing CUDA runtime API calls in their destructor, or simply avoid using those API calls in destructors altogether.

In detail:

The CUDA front end invoked by nvcc silently adds a lot of boilerplate code and translation unit scope objects which perform CUDA context setup and teardown. That code must run before any API calls which rely on a CUDA context can be executed. If your object containing CUDA runtime API calls in its destructor invokes the API after the context is torn down, your code may fail with a runtime error. C++ doesn't define the order of destruction when objects fall out of scope. Your singleton or object needs to be destroyed before the CUDA context is torn down, but there is no guarantee that will occur. This is effectively undefined behaviour.

You can see a more complete example of what happens (in the context of a kernel launch) in this answer.

cuda call fails in destructor

1 Answers1