1

I am trying to use MVAPICH2-GDR for the simple hello world program. Although it can compile the code successfully, it has segmentation fault error in runtime. My platform has Redhat 6.5 and CUDA 7.5. So I downloaded the rpm file mvapich2-gdr-cuda7.5-intel-2.2-0.3.rc1.el6.x86_64.rpm.

The MPI code is the simple hello world program:

1 #include <mpi.h>
2 #include <stdio.h>
3
4 int main(int argc, char** argv) {
5     // Initialize the MPI environment
6     MPI_Init(NULL, NULL);
7     // Get the number of processes
8     int world_size;
9     MPI_Comm_size(MPI_COMM_WORLD, &world_size);
10
11     // Get the rank of the process
12     int world_rank;
13     MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
14
15     // Print off a hello world message
16     printf("Hello world from %d out of %d\n", world_rank, world_size);
17
18     // Finalize the MPI environment.
19     MPI_Finalize();
20 }

To compile the program, I used the following command:

mpicc hello.c -o hello

To run the program:

mpirun -np 2 ./hello

The error message is as follows:

[localhost.localdomain:mpi_rank_1][error_sighandler] Caught error: Segmentation fault (signal 11)
[localhost.localdomain:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11)                
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 188057 RUNNING AT localhost.localdomain
=   EXIT CODE: 139
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES 

YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)

Because MVAPICH2-GDR did not open its source code, I really don't know where the error comes from. Has anyone successfully used MVAPICH2-GDR?

silence_lamb
  • 377
  • 1
  • 3
  • 12
  • You haver several calls to MPI functions. They usually return some kind of status, when they do, check then instead of blindly hoping they worked. And it would be a cinch to insert a few `printf` statements (ensuring they end with a `newline` to flush the output), to tell which of those call returns at all. And have you tried `MPI_Init(&argc, &argv);`? – Weather Vane Jul 25 '16 at 20:57

1 Answers1

1

to boost the performance of GPU-GPU communication MVAPICH2-GDR uses a new GDRCOPY module. You need to explicitly point MVAPICH2-GDR to the library or explicitly disable the usage of this feature by setting MV2_USE_GPUDIRECT_GDRCOPY=0.

As you can see by disabling this feature I'm able to run your code. For more information please refer to the userguide: http://mvapich.cse.ohio-state.edu/userguide/gdr/2.2rc1/

[hamidouc@ivy1 mvapich2]$ export MV2_USE_GPUDIRECT_GDRCOPY=0 [hamidouc@ivy1 mvapich2]$ ./install/bin/mpirun -np 2 ./a.out Hello world from 0 out of 2 Hello world from 1 out of 2

  • I did follow all steps in that userguide. Even though I set MV2_USE_GPUDIRECT_GDRCOPY=0, it still has segmentation fault error. I checked the status of each MPI call, the error happened in the first statement MPI_Init(NULL, NULL); The MVAPICH2-GDR OSU provided was working for MLNX-OFED 3.2, but I installed the latest driver version MLNX-OFED 3.3. Not sure whether that caused the problem, but 3.2 is not available on Mellanox website anymore. – silence_lamb Jul 26 '16 at 20:34