4

I'm trying to run Chapel/GASNet on a cluster equipped with Omni-path network.

GASNet official documentation for Omni-Path recommends to use the ofi-conduit by passing --enable-ofi --disable-psm --disable-ibv. However, as I do not know where to pass this configuration, I decided to use the PSM conduit for Omni-Path.

1) I can run Chapel/GASNet using GASNET_PSM_SPAWNER='ssh'. However, this spawner is resulting in quite slow PGAS.

2) I can only use MPI as the spawner if I set -mca mtl ^psm,psm2, which is also slow. Otherwise, I receive several errors.

3) I tried to use PMI as the spawner. However, I receive the following error message: Spawner is set to PMI, but PMI support was not compiled in usage: gasnetrun...

How can I compile the PMI support and for using GASNET_PSM_SPAWNER='pmi'?

Here are my other Chapel/GASNet runtime variables:

CHPL_COMM='gasnet'

CHPL_LAUNCHER='gasnetrun_psm'

CHPL_COMM_SUBSTRATE='psm'

CHPL_GASNET_SEGMENT='everything'

CHPL_TARGET_ARCH='native'

HFI_NO_CPUAFFINITY=1

All the best,

Tiago Carneiro.

Tiago Carneiro
  • 199
  • 1
  • 9
  • What is "quite slow PGAS"? What makes you think that has anything to do with the job spawner? It sounds like the question you actually want to ask is why is your program running slowly, but you have not provided any code or explanation of what your program is doing.. – Dan Bonachea Jan 09 '19 at 18:58
  • Hello Dan. I think I expressed myself in a wrong way. Using the Omini-path substrat is around `2x` slower than Infiniband. Moreover, it is strange that using either `SSH` or setting `MPI_RUN ... -mca mtl ^psm,psm2`, which runs on ethernet, results in the same execution time. Maybe this is a network problem, I dont know. According to GASNet documentation, PMI may be faster. That's why I would like to use it. However, Brad told me that Chapel cannot use PMI. Best regards. – Tiago Carneiro Jan 11 '19 at 02:35
  • Tiago - As Brad explained, the choice of *spawner* should NOT affect steady-state performance. The PMI/SSH/MPI spawners just get your processes up and running and connected, so anything that works without errors should be acceptable. Differences in spawner performance are only visible in job startup time at giant scales (ie thousands of ranks). However the choice of GASNet *conduit* (`CHPL_COMM_SUBSTRATE=ibv,psm,ofi`) makes a big difference to steady-state communication performance. Other settings like `CHPL_GASNET_SEGMENT` and Chapel `--fast` flag may also impact steady-state performance. – Dan Bonachea Jan 12 '19 at 06:06

1 Answers1

3

I don't have easy access to an Omni-path system to test any of this, but in the interest of trying to get you an answer:

It appears to me as though Chapel ought to build and use the ofi-conduit if you do the following:

  • set CHPL_COMM_SUBSTRATE=ofi in your environment (e.g., export CHPL_COMM_SUBSTRATE=ofi)
  • re-build Chapel (e.g., make or gmake from $CHPL_HOME)
  • re-compile and re-run your program

The choice of spawner/launcher that you use should not have an impact on your program's performance that I am aware of... It is simply the mechanism for getting the executables up and running on the system's compute nodes. That is, if you have a technique that is working, I'd suggest sticking with it rather than trying to use other spawners/launchers (In any case, I'm not personally familiar with how to use the PMI spawner and am fairly certain that Chapel doesn't currently have a launcher that wraps it).

By contrast, the choice of conduit can have a very large impact on program performance, as it governs how communication takes place throughout the program's execution.

As a reminder: As with any Chapel program, once you have it working correctly and are doing performance studies, be sure to use the --fast flag.

Brad
  • 3,839
  • 7
  • 25
  • Hello Brad, thank you for your answer. I tried to use OFI, but I receive the following errors: `configure error: User requested --enable-ofi but I don't know how to build ofi programs for your system`. I downloaded GASNet 1.32 and I face the same error after trying to build OFI by hand. I downloaded libfrabics, and I set OFI_HOME. Chapel returns the same errors. I observed that the error lies here: `checking for OFI_LDFLAGS setting... -L/home/user/libfabric-1.7.0/lib`, as there is no lib folder, even after performing ./configure and make. – Tiago Carneiro Jan 08 '19 at 03:33
  • Hello Brad. Thanks for the explanation concerning the spawner. For some reason, using Chapel + Omnipath results in a quite slow PGAS, just a little faster than MPI + `-mca mtl ^psm,psm2`. Much slower than IBV+MPI. So, I tried to use another spawner, as it is the last parameter I had to tune. So, as my code works perfectly on IBV and Omnipath + SSH, I'll continue with Omnipath + SSH. I'm using `--fast`. Thank you again! – Tiago Carneiro Jan 08 '19 at 03:39
  • 2
    Hi Tiago — A couple of other notes: (1) I'm not familiar enough with GASNet's configure step to know what might be going wrong when you try to build OFI, but I think that's worth contacting the GASNet team to ask about (see http://gasnet.lbl.gov/#contact and/or search their bugzilla database) (2) You may find that setting CHPL_GASNET_SEGMENT to 'fast' helps with your performance (3) It might be worth verifying that you're running on multiple nodes. Running a Chapel program with `--verbose` or running `$CHPL_HOME/examples/hello6-taskpar-dist.chpl` – Brad Jan 09 '19 at 18:02
  • 1
    Also, I'm curious: When you say "poor PGAS performance" how are you evaluating this? – Brad Jan 09 '19 at 18:02
  • 1
    Hello Brad. Thank you again. (1) I'll try contact GASNet and Paul Hargrove. (2) Ok, it was set to `CHPL_GASNET_SEGMENT='everything'`. Going to change. (3) The application I'm running performs an initial task and, then, distributes the data structure. Then, I use a forall+distributed iterators for evaluating the distributed data. Many thanks! – Tiago Carneiro Jan 11 '19 at 02:43