2

I am trying to build a cluster between 2 Ubuntu servers. I installed mpi by running:

sudo apt install libopenmpi-dev

I can ssh through both servers without password and have created a share NFS between the 2 servers. The issue is when I am trying to run a simple code to check if mpi is working: For example on my master node, if I run:

   
   mpirun -np 2 hostname
   

or just

   mpirun

the command hangs indefinitely without any error message.

I read it might come from my firewall so I disabled it:


   sudo ufw status
   Status: inactive

but still the problem remains.

I used solution proposed here sol

and ran:

   strace -f -- mpirun -np 1 localhost

The progam hangs at:

_flags=0}, 0) = 936 recvmsg(9, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base={{len=20, type=NLMSG_DONE, flags=NLM_F_MULTI, seq=1668671375, pid=1310460}, 0}, iov_len=4096}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 20 close(9)
= 0 socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC, IPPROTO_IP) = 9 connect(9, {sa_family=AF_INET, sin_port=htons(6006), sin_addr=inet_addr("127.0.0.1")}, 16) = 0 getsockname(9, {sa_family=AF_INET, sin_port=htons(52429), sin_addr=inet_addr("127.0.0.1")}, [28->16]) = 0 close(9)
= 0 socket(AF_INET6, SOCK_DGRAM|SOCK_CLOEXEC, IPPROTO_IP) = 9 connect(9, {sa_family=AF_INET6, sin6_port=htons(6006), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, 28) = 0 getsockname(9, {sa_family=AF_INET6, sin6_port=htons(42893), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, [28]) = 0 close(9)
= 0 socket(AF_INET6, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_TCP) = 9 setsockopt(9, SOL_TCP, TCP_NODELAY, [1], 4) = 0 setsockopt(9, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0 connect(9, {sa_family=AF_INET6, sin6_port=htons(6006), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, 28) = 0 getpeername(9, {sa_family=AF_INET6, sin6_port=htons(6006), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, [124->28]) = 0 uname({sysname="Linux", nodename="dcilda1872", ...}) = 0 access("/home/e177338/.Xauthority", R_OK) = 0 openat(AT_FDCWD, "/home/e177338/.Xauthority", O_RDONLY) = 10 fstat(10, {st_mode=S_IFREG|0600, st_size=1120, ...}) = 0 read(10, "\1\0\0\ndcilda1872\0\00213\0\22MIT-MAGIC-CO"..., 4096) = 1120 read(10, "", 4096) = 0 close(10)
= 0 fcntl(9, F_GETFL) = 0x2 (flags O_RDWR) fcntl(9, F_SETFL, O_RDWR|O_NONBLOCK) = 0 fcntl(9, F_SETFD, FD_CLOEXEC) = 0 poll([{fd=9, events=POLLIN|POLLOUT}], 1, -1) = 1 ([{fd=9, revents=POLLOUT}]) writev(9, [{iov_base="l\0\v\0\0\0\0\0\0\0\0\0", iov_len=12}, {iov_base="", iov_len=0}], 2) = 12 recvfrom(9, 0x558bd3c91e30, 8, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable) poll([{fd=9, events=POLLIN}], 1, -1

`

Would you have any idea ?

thanks in advance :)

yoshcn
  • 37
  • 3
  • `sudo iptables -L` to confirm no firewall is running. how many network interface does each machine have? – Gilles Gouaillardet Nov 17 '22 at 08:22
  • I checked, no firewall is running. on each of the machine I have the following network interfaces: Machine 1 --> Mater node where mpirun fails to run -bond0, -br_c****6afb107, -br-****, -docker0, -ens10***, -ens10***, -lo, -5 veth***, Machine 2 --> Worker Node: -br-1c2**** -docker0 -eno1 -lo -only 1 veth*** – yoshcn Nov 17 '22 at 11:45
  • Assuming `bond0` is the network that connects both nodes, try `mpirun --mca oob_tcp_if_include bond0 --mca btl_tcp_if_include bond0 ...` – Gilles Gouaillardet Nov 17 '22 at 11:47
  • I observed that If I am connecting to the 2nd server ( Worker node) and run `mpirun -np 2 hostname` the command is executed without issue. The problem seems to come from the first server (master node) – yoshcn Nov 17 '22 at 11:51
  • I run the command: `mpirun --mca oob_tcp_if_include bond0 --mca btl_tcp_if_include bond0` for different network interfaces. It still hangs. I ran `strace -f -- mpirun -np 1 localhost` and I saw a connection refused error: `connect(9, {sa_family=AF_INET6, sin6_port=htons(6004), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, 28) = -1 ECONNREFUSED (Connection refused) close(9) = 0 ` is it relevant ? – yoshcn Nov 17 '22 at 11:57
  • that could be related to X11. Try `unset DISPLAY` and see how it goes. – Gilles Gouaillardet Nov 17 '22 at 12:20
  • Unfortunately It did not work neither. I found a work around. I built a docker container from my first server and then ssh to my second server from docker. MPI seems to work there. Thanks for your time :) – yoshcn Nov 20 '22 at 15:49

0 Answers0