Setting up GPUDirect for infiniband

Question

I try to setup GPUDirect to use infiniband verbs rdma calls directly on device memory without the need to use cudaMemcpy. I have 2 machines with nvidia k80 gpu cards each with driver version 367.27. CUDA8 is installed and Mellanox OFED 3.4 Also the Mellanox-nvidia GPUDirect plugin is installed:

-bash-4.2$ service nv_peer_mem status
nv_peer_mem module is loaded.

According to this thread "How to use GPUDirect RDMA with Infiniband" I have all the requirements for GPUDirect and the following code should run successfully. But it does not and ibv_reg_mr fails with the error "Bad Address" as if GPUDirect is not properly installed.

void * gpu_buffer;
struct ibv_mr *mr;
const int size = 64*1024;
cudaMalloc(&gpu_buffer,size); // TODO: Check errors
mr = ibv_reg_mr(pd,gpu_buffer,size,IBV_ACCESS_LOCAL_WRITE|IBV_ACCESS_REMOTE_WRITE|IBV_ACCESS_REMOTE_READ);

Requested Info:
mlx5 is used.
Last Kernel log:

[Nov14 09:49] mlx5_warn:mlx5_0:mlx5_ib_reg_user_mr:1418:(pid 4430): umem get failed (-14)

Am I missing something? Do I need some other packets or do I have to activate GPUDirect in my code somehow?

Are you using the `mlx4` or the `mlx5` driver? Do you see any other related errors or warnings in the kernel log? — haggai_e, Nov 13 '16 at 06:43
Just to be sure, can you check that the `cudaMalloc()` call didn't fail? — haggai_e, Nov 16 '16 at 08:19
Note: I tested the same code on an other machine. There it works fine. — kusterl, Nov 16 '16 at 09:24
Interesting. Is there any notable difference between the two machines? — haggai_e, Nov 16 '16 at 09:26
I just noticed that on the new machine service nv_peer_mem status tells me "nv_peer_mem stop/waiting". So I thought maybe it is not proparly started on the first machine. Therefore I run "service nv_peer_mem start" which tells me "Starting... OK" but the status does not change and the program still fails. — kusterl, Nov 16 '16 at 10:17
Other differences: new machine has nvidia driver 352.39, OFED 3.2 and cuda 7.5 — kusterl, Nov 16 '16 at 10:27

score 3 · Accepted Answer · answered Nov 18 '16 at 06:27

3

A common reason for nv_peer_mem module failing is interaction with Unified Memory (UVM). Could you try disabling UVM:

export CUDA_DISABLE_UNIFIED_MEMORY=1

?

If this does not fix your problem, you should try running validation and copybw tests from https://github.com/NVIDIA/gdrcopy to check GPUDirectRDMA. If it works then your Mellanox stack is misconfigured.

answered Nov 18 '16 at 06:27

ptrendx

316
2
6

1

export CUDA_DISABLE_UNIFIED_MEMORY=1 did make it work. Thanks – kusterl Nov 18 '16 at 08:27

Setting up GPUDirect for infiniband

1 Answers1