Debugging CUDA MMU Fault

Question

In my code I repeatedly get memory access errors, and I cannot find the reason why this would happen.

What is a MMU error on CUDA in the first place, and how can I debug where its coming from? Currently it happens when defining a lambda function, but when I rewrite the code it happens at some other place, so its quite undefined behaviour, and I don't know how to even start debugging this.

score 2 · Answer 1 · answered Jul 07 '20 at 09:33

The MMU fault you are referring to is presumably an Xid 31 error as described here.

The most common reason for this in my experience is a CUDA code defect (code written by CUDA user, i.e. GPU kernel/device code) that results in an error occurring during the execution of a GPU kernel. Such issues, in my experience, are nearly always capturable/localizable using cuda-memcheck. (You can also use a debugger as described in the link above).

For these cases, the best method to begin the debug, IMO, is to start using the method described here. It is essentially what is being referred to in the document I linked above. Using that method, cuda-memcheck is generally able to localize the error to a specific line of source code for you. Thereafter you have additional debug avenues you can pursue, using in-kernel printf and/or a debugger, as described.

If cuda-memcheck does not report any issues, but the Xid 31 error is logged in your system logs each time you run a particular app, then as indicated in the first linked document, this is not really end-user debuggable (and should be a rare occurrence) and the only recourse at that point is to file a bug at developer.nvidia.com, using the general method described here.

This question is about someone's own code. If it was about running a particular app, it would be off-topic on this website. — user253751, Jul 07 '20 at 11:36

Debugging CUDA MMU Fault

1 Answers1