My understanding is that in the above, execution errors occurring during the asynchronous execution of either kernel may be returned by cudaGetLastError()
.
That is correct. The runtime API will return the last error which was encountered. It isn't possible to know from which call in a sequence of asynchronous API calls an error was generated.
In that case, how do I figure out which kernel caused the error to occur during runtime?
You can't. You would require some kind of additional API call between the two kernel launches to determine the error. The crudest would be a cudaDeviceSynchronize()
call, although that would serialize the operations if they actually did overlap (although I see no stream usage so that is probably not happening here).
As noted in comments -- most kernel runtime errors will result in context destruction, so if you got an error from the first kernel, the second kernel will abort or refuse to run anyway and that is probably fatal to your whole application.