2

I try to setup GPUDirect to use infiniband verbs rdma calls directly on device memory without the need to use cudaMemcpy. I have 2 machines with nvidia k80 gpu cards each with driver version 367.27. CUDA8 is installed and Mellanox OFED 3.4 Also the Mellanox-nvidia GPUDirect plugin is installed:

-bash-4.2$ service nv_peer_mem status
nv_peer_mem module is loaded.

According to this thread "How to use GPUDirect RDMA with Infiniband" I have all the requirements for GPUDirect and the following code should run successfully. But it does not and ibv_reg_mr fails with the error "Bad Address" as if GPUDirect is not properly installed.

void * gpu_buffer;
struct ibv_mr *mr;
const int size = 64*1024;
cudaMalloc(&gpu_buffer,size); // TODO: Check errors
mr = ibv_reg_mr(pd,gpu_buffer,size,IBV_ACCESS_LOCAL_WRITE|IBV_ACCESS_REMOTE_WRITE|IBV_ACCESS_REMOTE_READ);

Requested Info:
mlx5 is used.
Last Kernel log:

[Nov14 09:49] mlx5_warn:mlx5_0:mlx5_ib_reg_user_mr:1418:(pid 4430): umem get failed (-14)

Am I missing something? Do I need some other packets or do I have to activate GPUDirect in my code somehow?

einpoklum
  • 118,144
  • 57
  • 340
  • 684
kusterl
  • 29
  • 5
  • Are you using the `mlx4` or the `mlx5` driver? Do you see any other related errors or warnings in the kernel log? – haggai_e Nov 13 '16 at 06:43
  • Just to be sure, can you check that the `cudaMalloc()` call didn't fail? – haggai_e Nov 16 '16 at 08:19
  • cudaMalloc did not fail. – kusterl Nov 16 '16 at 09:23
  • Note: I tested the same code on an other machine. There it works fine. – kusterl Nov 16 '16 at 09:24
  • Interesting. Is there any notable difference between the two machines? – haggai_e Nov 16 '16 at 09:26
  • I just noticed that on the new machine service nv_peer_mem status tells me "nv_peer_mem stop/waiting". So I thought maybe it is not proparly started on the first machine. Therefore I run "service nv_peer_mem start" which tells me "Starting... OK" but the status does not change and the program still fails. – kusterl Nov 16 '16 at 10:17
  • Other differences: new machine has nvidia driver 352.39, OFED 3.2 and cuda 7.5 – kusterl Nov 16 '16 at 10:27

1 Answers1

3

A common reason for nv_peer_mem module failing is interaction with Unified Memory (UVM). Could you try disabling UVM:

export CUDA_DISABLE_UNIFIED_MEMORY=1

?

If this does not fix your problem, you should try running validation and copybw tests from https://github.com/NVIDIA/gdrcopy to check GPUDirectRDMA. If it works then your Mellanox stack is misconfigured.

ptrendx
  • 316
  • 2
  • 6