0

I assume we wish to do error handling such as described in this answer, where we test the return code and throw an exception if it is not success.

Now, suppose cudaEventDestroy returns an error from a previous, asynchronous launch, as the documentation notes it may do.

In this situation, has the event been successfully destroyed? More generally, can I expect any runtime functions to have successfully completed their function if they return an errors from a previous, asynchronous launch?

What can I do if this happens at a place in my code where it is inconvenient to actually handle the error, such as a destructor?

It seems like, if I don't want my programs to either randomly terminate or lose track of errors that I would have to implement a duplicate error recording system where I can record errors that occur in places that can't really handle them, and change the boilerplate for making runtime API calls to check both the return status and my duplicate error recording system. This seems rather awkward and suboptimal and I'm hoping I'm missing something simple.

Community
  • 1
  • 1
  • 1
    What error code are you actually asking about? Is this a *gedankenexperiment*, or do you have a concrete example? – talonmies Feb 02 '15 at 06:46

1 Answers1

3

More generally, can I expect any runtime functions to have successfully completed their function if they return an errors from a previous, asynchronous launch?

In general, no. Many types of errors resulting from previous, asynchronous launches are of a type that invalidate the CUDA context. Once a CUDA context has been invalidated, no further operations of any kind are possible with it, except to destroy it. In light of this, your question about the status of the hypothetical cudaEvent is moot.

What can I do if this happens at a place in my code where it is inconvenient to actually handle the error, such as a destructor?

Many types of CUDA errors are persistent, especially those that are reflective of an invalidated CUDA context[1]. Those types of errors cannot be cleared and will re-present themselves on any subsequent error-checking activity. You may be able to have a suitable level of error control with comprehensive error checking except in those places where it is inconvenient to do so. If your concern about error-checking during destructor activity is specifically during application tear-down, it's not clear that any of this is an issue.

[1]: For example:

"cudaErrorIllegalAddress = 77 The device encountered a load or store instruction on an invalid memory address. The context cannot be used, so it must be destroyed (and a new one should be created). All existing device memory allocations from this context are invalid and must be reconstructed if the program is to continue using CUDA."

Additional note:

To take the errors returned by an asynchronous launch example, the errors that do not invalidate a cuda context (such as invalid launch configuration) should be reported immediately and would be trapped by proper CUDA error checking on the kernel launch, and I would not expect these to be of a type that could only show up later, possibly during a destructor operation. Most of the errors that occur some time later after kernel execution begins, are of a type that would invalidate a CUDA context, and are persistent, and cannot be cleared.

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • This concern came up when I was in the process of writing some library code, and code that might be used in long running applications at that, so I can't actually just ignore errors that might leak resources. In addition to the explicit content, I think your posting also says that the resources associated with an event would be cleaned up when the context is destroyed, so I think this fully answers my concerns. –  Feb 02 '15 at 21:03
  • Yes, all resources associated with a context are cleaned up when a context is destroyed. – Robert Crovella Feb 02 '15 at 21:06