MPI remote memory access (RMA) inside C++ threads

Question

I'm trying to use an MPI remote memory access (RMA) inside C++ threads. This seems to work fine if the number of writes is small, but fails if I attempt to communicate often. The sources below exhibit the following behavior:

(i) if compiled with #define USE_THREADS false the program runs fine, irrespective of the value referred to via MY_BOUND;

(ii) if compiled with #define USE_THREADS true the program runs fine if executed with only two processes (mpiexec -n 2 ...), i.e., using only one thread for communication, irrespective of the value referred to via MY_BOUND;

(ii) if compiled with #define USE_THREADS true the program runs fine if the value referred to via MY_BOUND is low, e.g., 10 or 100, but the program crashes if the value referred to via MY_BOUND is large, e.g., 100000.

The error I obtain in the third case for large values of MY_BOUND is:

[...] *** Process received signal ***
[...] Signal: Segmentation fault: 11 (11)
[...] Signal code: Address not mapped (1)
[...] Failing at address: 0x0
[...] *** End of error message ***
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 0 on node ... exited on signal 11 (Segmentation fault: 11).
--------------------------------------------------------------------------

Attaching a debugger to the processes reveals the following:

thread #4: tid = 0x5fa1e, 0x000000010e500de7 mca_osc_pt2pt.so`ompi_osc_pt2pt_sync_pscw_peer + 87, stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
* frame #0: 0x000000010e500de7 mca_osc_pt2pt.so`ompi_osc_pt2pt_sync_pscw_peer + 87
frame #1: 0x000000010e4fd2a2 mca_osc_pt2pt.so`osc_pt2pt_incoming_post + 50
frame #2: 0x000000010e4fabf1 mca_osc_pt2pt.so`ompi_osc_pt2pt_process_receive + 961
frame #3: 0x000000010e4f4957 mca_osc_pt2pt.so`component_progress + 279
frame #4: 0x00000001059f3de4 libopen-pal.20.dylib`opal_progress + 68
frame #5: 0x000000010e4fc9dd mca_osc_pt2pt.so`ompi_osc_pt2pt_complete + 765
frame #6: 0x0000000105649a00 libmpi.20.dylib`MPI_Win_complete + 160

And when compiling OpenMPI with --enable-debug I get:

* thread #4: tid = 0x93d0d, 0x0000000111521b1e mca_osc_pt2pt.so`ompi_osc_pt2pt_sync_array_peer(rank=1, peers=0x0000000000000000, nranks=1, peer=0x0000000000000000) + 51 at osc_pt2pt_sync.c:61, stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
   frame #0: 0x0000000111521b1e mca_osc_pt2pt.so`ompi_osc_pt2pt_sync_array_peer(rank=1, peers=0x0000000000000000, nranks=1, peer=0x0000000000000000) + 51 at osc_pt2pt_sync.c:61
   58       int mid = nranks / 2;
   59   
   60       /* base cases */
-> 61       if (0 == nranks || (1 == nranks && peers[0]->rank != rank)) {
   62           if (peer) {
   63               *peer = NULL;
   64           }

Probably something simple is wrong in my implementation, but I have a hard time finding it, likely due to my lack in understanding MPI, C++ and threading. Any thoughts, comments or feedback? Thanks and happy new year.

My setup is GCC 4.9 with OpenMPI 2.0.1 (compiled with --enable-mpi-thread-multiple).

#include <iostream>
#include <vector>
#include <thread>
#include "mpi.h"

#define MY_BOUND 100000
#define USE_THREADS true

void FinalizeMPI();
void InitMPI(int, char**);
void NoThreads(int, int);

void ThreadFunction(int ThreadID, int Bound) {
  std::cout << "test " << ThreadID << std::endl;
  MPI_Group GroupAll, myGroup, destGroup;
  MPI_Comm myComm;
  MPI_Win myWin;
  MPI_Comm_group(MPI_COMM_WORLD, &GroupAll);
  int ranks[2]{0, ThreadID+1};
  MPI_Group_incl(GroupAll, 2, ranks, &myGroup);
  MPI_Comm_create_group(MPI_COMM_WORLD, myGroup, 0, &myComm);
  MPI_Win_create(NULL, 0, 1, MPI_INFO_NULL, myComm, &myWin);
  int destrank = 1;
  MPI_Group_incl(myGroup, 1, &destrank, &destGroup);

  std::cout << "Objects created" << std::endl;

  std::vector<int> data(5, 1);
  for(int m=0;m<Bound;++m) {
    MPI_Win_start(destGroup, 0, myWin);
    MPI_Put(&data[0], 5, MPI_INT, 1, 0, 5, MPI_INT, myWin);
    MPI_Win_complete(myWin);
  }

  MPI_Group_free(&destGroup);
  MPI_Win_free(&myWin);
  MPI_Comm_free(&myComm);
  MPI_Group_free(&myGroup);
  MPI_Group_free(&GroupAll);
}

void WithThreads(int comm_size, int Bound) {
  std::vector<std::thread*> Threads(comm_size-1, NULL);
  for(int k=0;k<comm_size-1;++k) {
    Threads[k] = new std::thread(ThreadFunction, k, Bound);
  }
  for(int k=0;k<comm_size-1;++k) {
    Threads[k]->join();
  }
  std::cout << "done" << std::endl;
}

int main(int argc, char** argv) {
    InitMPI(argc, argv);

    int rank,comm_size;
    MPI_Comm_rank(MPI_COMM_WORLD,&rank);
    MPI_Comm_size(MPI_COMM_WORLD,&comm_size);

    if(comm_size<2) {
        FinalizeMPI();
        return 0;
    }

    int Bound = MY_BOUND;

    if(rank==0) {
      if(USE_THREADS) {
        WithThreads(comm_size, Bound);
      } else {
        NoThreads(comm_size, Bound);
      }
    } else {
        MPI_Group GroupAll;
        MPI_Comm_group(MPI_COMM_WORLD, &GroupAll);
        std::vector<int> tmp(5,0);
        MPI_Group myGroup, destinationGroup;
        MPI_Comm myComm;
        MPI_Win myWin;
        int ranks[2]{0,rank};
        MPI_Group_incl(GroupAll, 2, ranks, &myGroup);
        MPI_Comm_create_group(MPI_COMM_WORLD, myGroup, 0, &myComm);
        MPI_Win_create(&tmp[0], 5*sizeof(int), sizeof(int), MPI_INFO_NULL, myComm, &myWin);

        int destrank = 0;
        MPI_Group_incl(myGroup, 1, &destrank, &destinationGroup);

        for(int m=0;m<Bound;++m) {
          MPI_Win_post(destinationGroup, 0, myWin);
          MPI_Win_wait(myWin);
        }

        std::cout << " Rank " << rank << ":";
        for(auto& e : tmp) {
          std::cout << " " << e;
        }
        std::cout << std::endl;

        MPI_Win_free(&myWin);
        MPI_Comm_free(&myComm);
        MPI_Group_free(&myGroup);
        MPI_Group_free(&GroupAll);
    }

    FinalizeMPI();
    return 0;
}

void FinalizeMPI() {
   int flag;
   MPI_Finalized(&flag);
   if(!flag)
      MPI_Finalize();
}

void InitMPI(int argc, char** argv) {
   int flag;
   MPI_Initialized(&flag);
   if(!flag) {
      int provided_Support;
      MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided_Support);
      if(provided_Support!=MPI_THREAD_MULTIPLE) {
          exit(0);
      }
   }
}

void NoThreads(int comm_size, int Bound) {
  MPI_Group GroupAll;
  MPI_Comm_group(MPI_COMM_WORLD, &GroupAll);
  std::vector<MPI_Group> myGroups(comm_size-1);
  std::vector<MPI_Comm> myComms(comm_size-1);
  std::vector<MPI_Win> myWins(comm_size-1);
  std::vector<MPI_Group> destGroups(comm_size-1);
  for(int k=1;k<comm_size;++k) {
     int ranks[2]{0, k};
     MPI_Group_incl(GroupAll, 2, ranks, &myGroups[k-1]);
     MPI_Comm_create_group(MPI_COMM_WORLD, myGroups[k-1], 0, &myComms[k-1]);
     MPI_Win_create(NULL, 0, 1, MPI_INFO_NULL, myComms[k-1], &myWins[k-1]);
     int destrank = 1;
     MPI_Group_incl(myGroups[k-1], 1, &destrank, &destGroups[k-1]);
  }

  std::vector<int> data(5, 1);

  for(int k=0;k<comm_size-1;++k) {
    for(int m=0;m<Bound;++m) {
      MPI_Win_start(destGroups[k], 0, myWins[k]);
      MPI_Put(&data[0], 5, MPI_INT, 1, 0, 5, MPI_INT, myWins[k]);
      MPI_Win_complete(myWins[k]);
    }
  }

  for(int k=0;k<comm_size-1;++k) {
    MPI_Win_free(&myWins[k]);
    MPI_Comm_free(&myComms[k]);
    MPI_Group_free(&myGroups[k]);
  }
  MPI_Group_free(&GroupAll);
}

UPDATE: During debugging I observed that the peers pointer doesn't seem to be NULL in the calling stack frame. Maybe another thread is updating the variable once the debugged thread crashed? Could that indicate an issue related to locking?

* thread #4: tid = 0xa7935, 0x000000010fa67b1e mca_osc_pt2pt.so`ompi_osc_pt2pt_sync_array_peer(rank=1, peers=0x0000000000000000, nranks=1, peer=0x0000000000000000) + 51 at osc_pt2pt_sync.c:61, stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
    frame #0: 0x000000010fa67b1e mca_osc_pt2pt.so`ompi_osc_pt2pt_sync_array_peer(rank=1, peers=0x0000000000000000, nranks=1, peer=0x0000000000000000) + 51 at osc_pt2pt_sync.c:61
   58       int mid = nranks / 2;
   59   
   60       /* base cases */
-> 61       if (0 == nranks || (1 == nranks && peers[0]->rank != rank)) {
   62           if (peer) {
   63               *peer = NULL;
   64           }
(lldb) frame select 1
frame #1: 0x000000010fa67c4c mca_osc_pt2pt.so`ompi_osc_pt2pt_sync_pscw_peer(module=0x00007fc0ff006a00, target=1, peer=0x0000000000000000) + 105 at osc_pt2pt_sync.c:92
   89           return false;
   90       }
   91   
-> 92       return ompi_osc_pt2pt_sync_array_peer (target, pt2pt_sync->peer_list.peers, pt2pt_sync->num_peers, peer);
   93   }
(lldb) p (pt2pt_sync->peer_list)
(<anonymous union>) $1 = {
  peers = 0x00007fc0fb534cb0
  peer = 0x00007fc0fb534cb0
}
(lldb) p pt2pt_sync->peer_list.peers
(ompi_osc_pt2pt_peer_t **) $2 = 0x00007fc0fb534cb0

Thanks @Gilles, any thoughts how I could verify your hint or any ideas about what I should look at inside the MPI library to check the stack sizes? Any particular variables inside MPI that I should monitor? — user3884566, Jan 01 '17 at 17:35
This is totally OS / compiler dependent and has nothing to do with MPI AFAICT. A bit of search on SO gave me [that link](http://stackoverflow.com/q/13871763/5239503) which tells so. Now, were I on Linux, I would try to add `-fuse-ld=gold -fsplit-stack` to the link options, just to see (without any certainty that would work). — Gilles, Jan 02 '17 at 06:14

MPI remote memory access (RMA) inside C++ threads

0 Answers0