The MMU fault you are referring to is presumably an Xid 31 error as described here.
The most common reason for this in my experience is a CUDA code defect (code written by CUDA user, i.e. GPU kernel/device code) that results in an error occurring during the execution of a GPU kernel. Such issues, in my experience, are nearly always capturable/localizable using cuda-memcheck
. (You can also use a debugger as described in the link above).
For these cases, the best method to begin the debug, IMO, is to start using the method described here. It is essentially what is being referred to in the document I linked above. Using that method, cuda-memcheck
is generally able to localize the error to a specific line of source code for you. Thereafter you have additional debug avenues you can pursue, using in-kernel printf
and/or a debugger, as described.
If cuda-memcheck
does not report any issues, but the Xid 31 error is logged in your system logs each time you run a particular app, then as indicated in the first linked document, this is not really end-user debuggable (and should be a rare occurrence) and the only recourse at that point is to file a bug at developer.nvidia.com, using the general method described here.