4

I'm trying to run a multilocale Chapel code on a cluster that has an MXM Infiniband network(40 Gbps, model: Mellanox Technologies MT26428).

I followed both Chapel and GASNet documentations, and I set

export CHPL_COMM_SUBSTRATE=ibv

export CHPL_LAUNCHER=gasnetrun_ibv

export GASNET_IBV_SPAWNER=mpi

instead of using CHPL_COMM_SUBSTRATE=mxm, once mxm is deprecated.

The problem is that I can build Chapel using the ibv substrate. But, I cannot run on multiple locales. I receive a huge number of timeout errors.

In the first place, I thought the problem was the PKEY. So, I added "--mca btl_openib_pkey "0x8100"" to the MPIRUN_CMD. But, no success.

I also tried to use the deprecated mxm configuration:

CHPL_LAUNCHER=gasnetrun_mxm

export CHPL_LAUNCHER=gasnetrun_ibv

export GASNET_MXM_SPAWNER=mpi

However, I cannot build Chapel with such a configuration. That's the error message:

"User requested --enable-mxm, but I don't know how to build mxm programs for your system."

By the way, using GASNET on top of MPI, UDP, and Infiniband without a Partition Key works just fine.

Does anybody know how to use Chapel on a Cluster equipped with an MXM Infiniband network and Partition Key (PKEY)?

Best Regards,

Tiago Carneiro.

Tiago Carneiro
  • 199
  • 1
  • 9
  • In your initial setup that runs, but just times out, what output do you get when running with the --verbose flag (which should show the underlying commands used to launch the program)? – Brad Dec 08 '18 at 15:59
  • It sounds to me like you need to rebuild chapel with support for IB and make sure that all the libraries that are required(I don't use chapel, so I know little about the build process) are in the correct spots during the initial install or build. Doing a quick google search would lead me to believe you need to have a correctly installed OFED implementation, I would recommend Mellanox unless there is a reason not to use that. Make sure you have the full OFED installed and it might have the legacy packages you need. – Matt Dec 08 '18 at 19:11
  • 1
    Hello Brado. Using the IBV spawner as ``SSH``, building and compilling the distributed hello world, I get: ``*** FATAL ERROR: failed to connect (snd) status=12``. In turn, using MPI as the spawner and without passing the PKEY with --mca, I get ``too many retries sending message to 0x0027:0x00003a15, giving up``. Finally, passing the correct PKEY to MPI, I get ``mpirun noticed that process rank 8 with PID 19647 on node uvb-20 exited on signal 6 (Aborted).``. What makes me curious is that the ``make`` says that cannot build for MXM. Is this configuration missing in Chapel? – Tiago Carneiro Dec 09 '18 at 23:14
  • Hello Matt. I did that. I made an script for MXM network, then, I rebuilt Chapel and the distributed example. The problem is that I receive the following message: ``"User requested --enable-mxm, but I don't know how to build mxm programs for your system."``. I do not understant why. Is this configuration is missing inside Chpl third-party code? I took a look, the number of files is the same. Moreover, I'll try to download the OFED software. I just dont know whether I can install this software in a instance of a cluster. – Tiago Carneiro Dec 09 '18 at 23:23
  • 1
    Hi Tiago — We don't have much experience using MXM or PKEYs on my team. The message about "I don't know how to build mxm programs for your system" is almost certainly coming from GASNet rather than from Chapel ('git grep'ping the sources shows hits in the gasnet sources), and I suspect that the other issues you're running into are ones you'd hit trying to get any GASNet program to run on the system. Have you had luck building a GASNet-only hello world program for this system (using either a standalone version or the one bundled with Chapel)? I've asked the GASNet team for ideas as well. – Brad Dec 10 '18 at 21:20
  • 1
    Hello Brad! Thank you for your reply. I'll do very soon this task you suggested. I'll build a hello-world from GASNet. I'll also try to build for other clusters, as I'm also facing problems with other networks. Thank you again. – Tiago Carneiro Dec 11 '18 at 12:20

1 Answers1

6

Tiago,

As the author and maintainer of GASNet's ibv-conduit (support for libibverbs) I can tell you that we have never had support for a non-default PKey. The message *** FATAL ERROR: failed to connect (snd) status=12 is consistent with use of the wrong PKey.

Based on your question here, I have made an attempt to provide support for a user-specified PKey. You can find my prototype as a pull-request in the GASNet git repository at Bitbucket: https://bitbucket.org/berkeleylab/gasnet/pull-requests/248 (or https://bitbucket.org/PHHargrove/gasnet-public/commits/ibv-pkey/raw to get just a raw patch). You should be able to apply the one commit in that PR in the third-party/gasnet/gasnet-src directory of the Chapel source. I don't have a partitioned IB network to test on. So, you would be helping me out if you can verify this resolves your problem.

Regarding User requested --enable-mxm, but I don't know how to build mxm programs for your system, I suspect that GASNet's configure probe was unable to find the necessary headers or libraries. Details of the failure should be in a config.log file below third-party/gasnet/build. If your mxm headers and libs are installed in a location other than /opt/mellanox/mxm then you can set the environment variable MXM_HOME when building Chapel, to inform GASNet's configure script of the actual location. However, I am not aware of any PKey support in libmxm. So, this might be a dead end.

-Paul

  • 1
    Hello Paul, thank you for your reply! I can apply and test on my network within three days. Then, I'm going to reply, ok? Concerning the headers, I'm going to build it and a standalone version of GASNet, as well. Thank you again! – Tiago Carneiro Dec 11 '18 at 12:31
  • Hello Paul. I'm trying to apply your changes to Chapel's source code. However, I cannot find your modifications in branch `develop` after cloning GASNet. What branch should I clone? Thank you. – Tiago Carneiro Dec 17 '18 at 20:08
  • 1
    Hello Tiago, the changes are in a pull-request (linked in my answer) and will not be accepted into develop until you have confirmed they are correct. Please try the following, starting in the unmodified Chapel source directory `cd third-party/gasnet/gasnet-src && wget -q -O - https://bitbucket.org/PHHargrove/gasnet-public/commits/ibv-pkey/raw | patch -p1` – Paul H. Hargrove Dec 18 '18 at 04:35
  • Hello Paul - good news, It works! First I modified the command: `cd .../gasnet-scr/ibv-conduit`. Then, I had to export `GASNET_IBV_PKEY='0x8100'`, which is my pkey. Otherwise, the connection returns `status=12`. Finally, it is necessary to manually set `MPIRUN_CMD="... --mca btl_openib_pkey "0x8100" ..."` in case `GASNET_IBV_SPAWNER=mpi`. Thank you again for your patience and time! – Tiago Carneiro Dec 18 '18 at 15:52