0

First and foremost: I am completely unable to create a MCVE, as I can only reproduce this when running a full code, any attempt to measure or replicate the error in a simpler environment makes it disappear. TDLR I suspect its not a code problem, but a configuration problem.


I have a piece of code for some mathematics on kernels in CUDA. I have a windows machine Win10 x64, GTX 1050, CUDA 9.2 and a Ubuntu 17.04, 2xGTX 1080 Ti, CUDA 9.1.

My code runs good on the windows machine. It is long (~700ms per kernel call for big samples) so I needed to increase the TDR value in windows. The code also (for now) forces it to run in 1 GPU, the first one that is selected with cudaSetDevice(0).

When I copy the same input data and code to the linux machine (I am using git, it is the same code), I get either

 an illegal memory access was encountered

or

 unspecified launch failure

in my error checking after the GPU call.

If I change the kernel to instead do the math, to just write a number in the output, the kernel executes properly. Other CUDA code (different functions that I have) works fine too. All this leads me to think that there is a problem outside the code, not with the code itself, nor with the general configuration of the drivers/environment variables.

I read that the xorg.conf can have an effect on the timeout of the kernels. I generated a xorg.conf (I had none) and remove the devices from there, as suggested here. I am connecting to the server remotely, and have no monitor plugged in. This changes nothing in the behavior, my kernels still error.

My question is: what else should I look? What linux specific configuration should I have a look at to pinpoint the cause of the kernel halts?

Ander Biguri
  • 35,140
  • 11
  • 74
  • 120
  • 2
    first and foremost, if your statement is this: "I am completely unable to create a MCVE", then your question is probably not suitable for SO. SO makes this clear and uses the word **must** [here](https://stackoverflow.com/help/on-topic) in item 1. You are clearly seeking debugging help. If it were simply permissible for people to skip this, and/or to write questions in such a way as to avoid this requirement, then everybody could do that, and there would be little point in spelling out such a requirement. – Robert Crovella Oct 30 '18 at 15:42
  • Now, to your question, the first 2 things I would do are: 1. Verify, using `deviceQuery` on the linux machine, that CUDA believes that your kernels are not runtime-limited. Check **both** GPUs. If you are using 1 GPU for display on linux, it has runtime limited kernels, and may also be the one selected first for CUDA. 2. Use the method described [here](https://stackoverflow.com/questions/27277365/unspecified-launch-failure-on-memcpy/27278218#27278218) to start the debug process on the `an illegal memory access was encountered` error – Robert Crovella Oct 30 '18 at 15:44
  • @RobertCrovella yup I am completely aware, and I will delete this if that is what teh reception is. Now, there are few questions in SO without a MCVE of the same exact problem in widows where the answer is just "increase TDR time because your kernel is probably taking too long". I am quite sure it is not the code itself either, as I have been using it for quite a while now on a different machine with 100% success rate, so even if I share the entire thousands of lines of code, that may not even be a MCVE as it may not reproduce the error on a different machine... – Ander Biguri Oct 30 '18 at 15:48
  • 1
    If you discover that one or both of your GPUs are runtime limited on linux, I would review [this document](https://nvidia.custhelp.com/app/answers/detail/a_id/3029/~/using-cuda-and-x) and make it so that at least one of the 2 GPUs is not runtime limited. Then use the CUDA_VISIBLE_DEVICES [env variable](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars) on linux to "hide" the display GPU and force the CUDA code to run on the GPU that is not runtime limited. – Robert Crovella Oct 30 '18 at 15:49
  • @RobertCrovella thanks I will look at those. Will the second method report useful information if the error would be generated by something else rather than the code? e.g. timeout – Ander Biguri Oct 30 '18 at 15:49
  • 1
    No, I would not use the second method if you suspect a timeout error. The first thing I would do is to make sure that it could not possibly be a timeout error, by following the steps I've already outlined. **After** that, when you have ruled out the possibility of a timeout error on linux, then begin typical debug. – Robert Crovella Oct 30 '18 at 15:51
  • @RobertCrovella the devices are not being runtime limited apparently. – Ander Biguri Oct 30 '18 at 16:25
  • 2
    Then if it were me, I would pursue typical debug. If `cuda-memcheck` is reporting an illegal access or unspecified launch failure, then, if it were me, I would start with extracting some more information from `cuda-memcheck` using the method I already linked to. Beyond that, I can't really say anything about your code. Yes, I acknowledge your claims that there could not possibly be anything wrong with the code. But `cuda-memcheck` thinks otherwise. I would grab the tiger by the tail, and see where it leads you. – Robert Crovella Oct 30 '18 at 16:48
  • @RobertCrovella figured it out. I'd say this is the epitome of *too specific* – Ander Biguri Oct 31 '18 at 12:02

1 Answers1

0

The error ended up being indeed illegal memory access.

These were caused by the fact that sizeof(unsigned long) is machine specific, and my linux machine returns 8 while my windows machine returns 4. As this code is called from MATLAB, and MATLAB (like some other high level languages such as python) defines the sizes of variables in bits (such as uint32(1)) there was a mismatch in the linux machine when doing memcpys. Turns out that this happened in a variable that is a index, so the kernels were reading garbage (due to the bad memcpy), but then triying to access another array at that location, creating an illegal memory error.

Too specific? yeah.

Ander Biguri
  • 35,140
  • 11
  • 74
  • 120