0

I'm updating a Windows & MacOS C++ code base for a client of mine to use CUDA 10.1 instead of [8? something older], and there are 2 sections in a big internal error code list for recognized CUDA error codes; one for driver and one for runtime. Since newer error entries were missing on these lists I consulted the official driver's CUresult enum docu and the official runtime's cudaError enum docu, but to my surprise these have compatible error codes (with a few unique extra ones e.g.). The runtime error list in the code base I'm updating however is radically different from the driver list.

To probe further I picked out code 30 from the current runtime list, and it is treated as Unsupported CUDA Card. But 30 is not even a valid option in both of the official error code lists. However, googling on cuda runtime error 30 nets me lots of forum posts about people getting error code 30, often mentioning the term "unknown error" (could be due to them passing it to cudaGetErrorName or maybe even cuGetErrorName, and getting that back?).

So while this code base has 30 in it and the official lists don't even mention anything 30-like, it does seem somewhat legitimate.

What am I missing here?

Carl Colijn
  • 1,423
  • 9
  • 29
  • 3
    Nothing. They are are enums which are supposed to be used by name, not by value. That is the whole point of them. Expecting that the values would remain the same in perpetuity is naive to put it mildly. See a [prophetic warning from 2012](https://stackoverflow.com/a/13041774/681865) regarding a CUBLAS equivalent as an example of exactly why this is such a poor idea – talonmies Jan 19 '21 at 12:36
  • True indeed; I'd expect values to get deprecated or added and such, but I not such a radical enum value change. I'll see if I can dig up an ancient docu page on this enum. But in this case I must map the CUDA errors to a range in a bigger internal unified error code list (there are lots of other APIs used, all with their own errors too), and the system is int based and not text based... – Carl Colijn Jan 19 '21 at 12:41
  • @talonmies I checked the 8.0 runtime docu (the oldest one still online at NVidia's website), and in that list code 30 does appear as "cudaErrorUnknown" instead of the value 999 it has been given now. And the list is indeed contiguous from 1 to about 80, also quite unlike what it is today. So there you go. If you add an answer instead of a comment I'd gladly mark it as such. – Carl Colijn Jan 19 '21 at 12:52

1 Answers1

1

All the error codes in the CUDA runtime API, driver API, and application APIs (CUBLAS, for example) are documented and supplied as C/C++ enum types in header files. The design is intended for programmers to use these enumerations by name, not by value. Using them by value would leave the naïve or foolhardy programmer open to the possibility that the values could change if NVIDIA decided to revalue the enumerations as part of an overhaul or refactor of the APIs. This has happened several times over the 12 odd years of CUDA development, as you have discovered.

While it is more typing to use enumeration names rather that values, it is future proof. The alternative isn't much fun. This isn't the first time I have counselled not to be tempted to design any code you care about around values, because it can and probably will break at some point in the future.

Hopefully this answer will contribute to it being the last.

talonmies
  • 70,661
  • 34
  • 192
  • 269
  • Amen to that last sentence :) I've already devised a way to map this code base's internal `fixed int -> internal enum -> text` system to a `relative int + manufacturer code -> internal enum -> text` system, so all will be good there too from now on. – Carl Colijn Jan 19 '21 at 13:18