1

My application performs one-to-one one-sided communications (every machine has active communications with all the other machines)

I am observing performance bottlenecks in network bandwidth, and concerning moving some parts of communications to collective calls if I can reduce bandwidth usage.

What if I use MPI collectives instead of one-sided communication calls? Can it reduces the total network bandwidth utilization? It will depend on the implementation of MPI; (I am using Intel MPI over Mellanox Infiniband.)

If Infiniband's RDMA supports bandwidth-efficient broadcast or multicast functionality, MPI will directly benefit from that.

The following is a part of my current usage of one-sided communication, which can be changed into MPI_BCast by defining sub-groups.

In each process,
For i in [1, ..., k]
  MPI_RGet (buf[i], my_rank + i);

Thanks

syko
  • 3,477
  • 5
  • 28
  • 51
  • 2
    Collectives don't reduce bandwidth usage but they can reduce latency costs with eg recursive doubling. How big are your messages? Is your pattern alltoall (regular) or alltoallv (irregular)? Do you know the message counts everywhere or not? Is this FFT, transpose or sorting application? I can give a much better answer with more detail. – Jeff Hammond Feb 14 '16 at 02:55
  • @Jeff Thanks, Jeff. [1]. Could you explain why it does not reduce bandwidth usage? (From the google I can file some docs like this; http://www.mcs.anl.gov/events/workshops/p2s2/2015/slides/P2S2-2015-huanzhou.pdf) [2]. Some portion of a big array is assigned exclusively to each process. Each process has small buffer and repeats receive & computation on it (keep sliding over all the elements) This is not like FFT which performs intensive shuffling – syko Feb 14 '16 at 03:15
  • How did you determine the performance bottleneck? Do you have actual evidence or is it just a guess? Make sure to use a [proper performance analysis tool](http://stackoverflow.com/q/18191635/620382) first, before tying to optimize based on a guess. – Zulan Feb 14 '16 at 06:56
  • @Zulan I made a measurment using Intel MPI Trace Analyzer and also analyzed the total bandwidth my application used during run-time. It's over 12GB/sec which is close to the limit of our cluster's network bandwidth – syko Feb 14 '16 at 08:00
  • How many nodes does the application run on? Whats the network topology? Do the 12 GB/s apply to a specific link? A end-to-end pair of peers? I don't think you can get a helpful generic answer (technically Jeff already gave it). We can help you better if you provide specific details, especially results from the performance analysis so far. – Zulan Feb 14 '16 at 09:04
  • @syko I meant that collectives do not reduce bandwidth relative to an appropriately written point-to-point implementation. Obviously, if you compare an O(n) broadcast to an O(log n) one, there is a difference, but you can write O(log n) in point-to-point. Anyways, I am still waiting to learn the specifics of your usage. – Jeff Hammond Mar 22 '16 at 13:32

0 Answers0