I am trying to build a cluster between 2 Ubuntu servers. I installed mpi by running:
sudo apt install libopenmpi-dev
I can ssh through both servers without password and have created a share NFS between the 2 servers. The issue is when I am trying to run a simple code to check if mpi is working: For example on my master node, if I run:
mpirun -np 2 hostname
or just
mpirun
the command hangs indefinitely without any error message.
I read it might come from my firewall so I disabled it:
sudo ufw status
Status: inactive
but still the problem remains.
I used solution proposed here sol
and ran:
strace -f -- mpirun -np 1 localhost
The progam hangs at:
_flags=0}, 0) = 936 recvmsg(9, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base={{len=20, type=NLMSG_DONE, flags=NLM_F_MULTI, seq=1668671375, pid=1310460}, 0}, iov_len=4096}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 20 close(9)
= 0 socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC, IPPROTO_IP) = 9 connect(9, {sa_family=AF_INET, sin_port=htons(6006), sin_addr=inet_addr("127.0.0.1")}, 16) = 0 getsockname(9, {sa_family=AF_INET, sin_port=htons(52429), sin_addr=inet_addr("127.0.0.1")}, [28->16]) = 0 close(9)
= 0 socket(AF_INET6, SOCK_DGRAM|SOCK_CLOEXEC, IPPROTO_IP) = 9 connect(9, {sa_family=AF_INET6, sin6_port=htons(6006), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, 28) = 0 getsockname(9, {sa_family=AF_INET6, sin6_port=htons(42893), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, [28]) = 0 close(9)
= 0 socket(AF_INET6, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_TCP) = 9 setsockopt(9, SOL_TCP, TCP_NODELAY, [1], 4) = 0 setsockopt(9, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0 connect(9, {sa_family=AF_INET6, sin6_port=htons(6006), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, 28) = 0 getpeername(9, {sa_family=AF_INET6, sin6_port=htons(6006), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, [124->28]) = 0 uname({sysname="Linux", nodename="dcilda1872", ...}) = 0 access("/home/e177338/.Xauthority", R_OK) = 0 openat(AT_FDCWD, "/home/e177338/.Xauthority", O_RDONLY) = 10 fstat(10, {st_mode=S_IFREG|0600, st_size=1120, ...}) = 0 read(10, "\1\0\0\ndcilda1872\0\00213\0\22MIT-MAGIC-CO"..., 4096) = 1120 read(10, "", 4096) = 0 close(10)
= 0 fcntl(9, F_GETFL) = 0x2 (flags O_RDWR) fcntl(9, F_SETFL, O_RDWR|O_NONBLOCK) = 0 fcntl(9, F_SETFD, FD_CLOEXEC) = 0 poll([{fd=9, events=POLLIN|POLLOUT}], 1, -1) = 1 ([{fd=9, revents=POLLOUT}]) writev(9, [{iov_base="l\0\v\0\0\0\0\0\0\0\0\0", iov_len=12}, {iov_base="", iov_len=0}], 2) = 12 recvfrom(9, 0x558bd3c91e30, 8, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable) poll([{fd=9, events=POLLIN}], 1, -1
`
Would you have any idea ?
thanks in advance :)