Possible buffer size limit in mpi4py Reduce()

Question

The Setup

I'm using mpi4py to element-wise reduce a numpy array across multiple processes. The idea is that the numpy arrays get summed element-wise, so that if I have two processes, and each has arrays:

Rank 0: [1, 1, 1]
Rank 1: [2, 3, 4]

after reduction I should have

[3, 4, 5]

This case, with such short arrays, works fine.

The Problem

However, in my real use-case these arrays are quite long (array_length in my example code below). I have no problems if I send numpy arrays of length less than or equal to 505 elements, but above that, I get the following output:

[83621b291fb8:01112] Read -1, expected 4048, errno = 1

and I've been unable to find any documented reason why this might be. Interestingly, however, 506*8 = 4048, which - assuming some header data - makes me suspect I'm hitting a 4kb buffer limit somewhere inside mpi4py or MPI itself.

One Possible Work-Around

I've managed to work around this problem by breaking down the numpy array I want to element-wise reduce into chunks of size 200 (just an arbitrary number less than 505), and calling Reduce() on each chunk, then reassembling on the master process. However, this is somewhat slow.

My Questions:

Does anyone know if this is indeed due to a 4kb buffer limit (or similar) in mpi4py/MPI?
Is there a better solution than slicing the array into pieces and making many calls to Reduce() as I am currently doing, as this seems a bit slow to run.

Some Examples

Below is code that illustrates

the problem, and
one possible solution, based on slicing the array into shorter pieces and doing lots of MPI Reduce() calls, rather than just one (controlled with the use_slices boolean)

With case=0 and use_slices=False, the error can be seen (array length 506)

With case=1 and use_slices=False, the error vanishes (array length 505)

With use_slices=True, the error vanishes, regardless of case, and even if case is set to a very long array (case=2)

Example Code

import mpi4py, mpi4py.MPI
import numpy as np

###### CASE FLAGS ########
# Whether or not to break the array into 200-element pieces
# before calling MPI Reduce()
use_slices = False

# The total length of the array to be reduced:
case = 0
if case == 0:
    array_length= 506
elif case == 1:
    array_length= 505
elif case == 2:
    array_length= 1000000

comm = mpi4py.MPI.COMM_WORLD
rank = comm.Get_rank()
nprocs = comm.Get_size()


array_to_reduce = np.ones(array_length)*(rank+1)  #just some different numbers per rank
reduced_array = np.zeros(array_length)

if not use_slices:
    comm.Reduce(array_to_reduce,
                reduced_array,
                op = mpi4py.MPI.SUM,
                root = 0)

    if rank==0:
        print(reduced_array)
else:  # in this case, use_slices is True
    array_slice_length = 200
    sliced_array = np.array_split(array_to_reduce, range(200, array_length, 200))

    reduced_array_using_slices = np.array([])
    for array_slice in sliced_array:
        returnedval = np.zeros(shape=array_slice.shape)
        comm.Reduce(array_slice,
                    returnedval,
                    op = mpi4py.MPI.SUM,
                    root = 0)
        reduced_array_using_slices=np.concatenate((reduced_array_using_slices, returnedval))
        comm.Barrier()

    if rank==0:
        print(reduced_array_using_slices)

Library Versions

Compiled from source - openmpi 3.1.4 mpi4py 3.0.3

It helps if you provide the version of the MPI library and runtime that mpi4py is built with. — Hristo Iliev, May 17 '20 at 21:09
openmpi 3.1.4, mpi4py 3.0.3. Question edited to reflect this. — carthurs, May 17 '20 at 22:46
Are you running the two ranks on the same node or on two separate nodes? If running on separate nodes, what kind of communication network do you have? — Hristo Iliev, May 18 '20 at 07:27
@GillesGouaillardet - thanks, unfortunately this didn't change anything. — carthurs, May 18 '20 at 09:55
@HristoIliev - I'm on a single node. I'm running in a Linux Docker container on a Linux desktop host with a single physical CPU (four cores). I don't think it's a Docker problem though - similar errors have been plaguing me on various setups for quite some time, including native Linux, and Linux in a Virtualbox VM. Ubuntu 12.04, 16.04, openSuse 15.2. I get a similar error from `h5py`'s `create_dataset()` (compiled against `MPI`), which I'm trying to work round now. Could I have miscompiled openMPI? - although I _suspect_ I've tried the OS repo openMPI on at least some of these hosts... — carthurs, May 18 '20 at 10:05
@HristoIliev - in case it helps, I can avoid the `h5py` `create_dataset()` error message by not specifying `maxshape=` in that call. This is very similar to this previous (unresolved) problem I had on another system: https://stackoverflow.com/questions/56131655/hdf5-h5py-segfault-with-mpi-with-specific-numbers-of-mpi-threads - I'd be surprised if this wasn't related to the `Reduce()` issue I'm seeing now. — carthurs, May 18 '20 at 10:25
So you are invoking `mpirun` inside your container, and your 2 MPI tasks will hence run inside the very same container? — Gilles Gouaillardet, May 18 '20 at 10:27
What happens if you do not run in a Docker container and with Open MPI supplied by the distro? Check your paths and library directories to make sure you are not mixing up different versions. — Hristo Iliev, May 18 '20 at 10:30
@GillesGouaillardet yes exactly - for example, I see the error even if I open the container interactively and type `mpirun`... into the container's shell. — carthurs, May 18 '20 at 10:32
the errno is `EPERM` and to me that suggests a container related issue. — Gilles Gouaillardet, May 18 '20 at 10:38
how many interfaces in your container? you can restrict to a single one with (for example) `mpirun --mca btl_tcp_if_include eth0 ...` — Gilles Gouaillardet, May 18 '20 at 10:38
`case = 2, use_slices = False` runs perfectly fine for me on Ubuntu 16.04 with Open MPI 2.1.1 from the distro. — Hristo Iliev, May 18 '20 at 10:44
@GillesGouaillardet, aren't single-node runs using the shared memory BTL? — Hristo Iliev, May 18 '20 at 10:46
@GillesGouaillardet - just the one interface -`eth0` - (plus the loopback). Using `btl_tcp_if_include eth0` gives the same error. — carthurs, May 18 '20 at 10:50
@HristoIliev - yes I checked, and I'm not getting the error natively, either (openMPI 1.10.7). However, I'm very suspicious that there might be some UB here. - I'll keep exploring - thanks for your advice, both (+@GillesGouaillardet). — carthurs, May 18 '20 at 10:55
What if you `mpirun -mca btl vader,self...` or `mpirun -mca btl sm,self` or `mpirun -mca tcp,self` ? — Gilles Gouaillardet, May 18 '20 at 11:59
@GillesGouaillardet - in order: 1) same effect; 2) error "As of version 3.0.0, the "sm" BTL is no longer available in Open MPI."; 3) "unable to find executable" - but I think you meant `mpirun -mca btl tcp,self` (with the `btl`). This does not display the error. Any thoughts on why? I'm still wondering if this is UB, as opposed to a fix - because I rebuilt my container with openMPI 1.10.7 (from source - as this version apparently worked natively on my system), and it segfaults in place of the original error message, whether or not I use tcp. There's no other versions of MPI in the container. — carthurs, May 18 '20 at 12:55
Maybe because you are in a container :-) what about `mpirun -mca btl_vader_single_copy_mechanism none ...` ? — Gilles Gouaillardet, May 18 '20 at 13:02
This also works - again in the original container with openMPI 3.1.4. What do you think is happening? Something to do with how the container interacts with MPI? — carthurs, May 18 '20 at 13:18
by default, `btl/vader` uses a single copy mechanism based on `process_vm_readv()` and `process_vm_writev()`, and those system calls are not allowed inside a default docker container. — Gilles Gouaillardet, May 18 '20 at 14:05
The [default `seccomp` configuration](https://github.com/moby/moby/blob/327a0b4ae430fd670ac84a5cbdb3b9fc035a3b88/profiles/seccomp/default.json#L726) for Docker seems to limit the CMA calls used by Open MPI to containers with `CAP_SYS_PTRACE` capability. Try starting the container with `--add-cap=SYS_PTRACE` and then try without setting `btl_vader_single_copy_mechanism`. — Hristo Iliev, May 18 '20 at 21:31
@HristoIliev I tested your suggestion, and it works too: `docker run --cap-add=SYS_PTRACE`... without setting `btl_vader_single_copy_mechanism` to `none`. (Note the syntax is `--cap-add` rather than `--add-cap`). I didn't observe any performance difference between the two solutions, in my case. — carthurs, May 19 '20 at 12:12
I'll summarise the findings in an answer, in case someone else runs Open MPI in docker containers and encounters the same issue. — Hristo Iliev, May 19 '20 at 12:49

Hristo Iliev · Accepted Answer · 2020-05-19T13:48:25.920

This is not a problem with mpi4py per se. The issue comes from the Cross-Memory Attach (CMA) system calls process_vm_readv() and process_vm_writev() that the shared-memory BTLs (Byte Transfer Layers, a.k.a. the things that move bytes between ranks) of Open MPI use to accelerate shared-memory communication between ranks that run on the same node by avoiding copying the data twice to and from a shared-memory buffer. This mechanism involves some setup overhead and is therefore only used for larger messages, which is why the problem only starts occurring after the messages size crosses the eager threshold.

CMA is part of the ptrace family of kernel services. Docker uses seccomp to limit what system calls can be made by processes running inside the container. The default profile has the following:

    {
        "names": [
            "kcmp",
            "process_vm_readv",
            "process_vm_writev",
            "ptrace"
        ],
        "action": "SCMP_ACT_ALLOW",
        "args": [],
        "comment": "",
        "includes": {
            "caps": [
                "CAP_SYS_PTRACE"
            ]
        },
        "excludes": {}
    },

limiting ptrace-related syscalls to containers that have the CAP_SYS_PTRACE capability, which is not among the capabilities granted by default. Therefore, to enable the normal functioning of Open MPI in Docker, one needs to grant the required capability by calling docker run with the following additional option:

--cap-add=SYS_PTRACE

This will allow Open MPI to function properly, but enabling ptrace may present a security risk in certain container deployments. Therefore, an alternative is to disable the use of CMA by Open MPI. This is achieved by setting an MCA parameter depending on the version of Open MPI and the shared-memory BTL used:

for the sm BTL (default before Open MPI 1.8): --mca btl_sm_use_cma 0
for the vader BTL (default since Open MPI 1.8): --mca btl_vader_single_copy_mechanism none

Disabling the single-copy mechanism will force the BTL to use pipelined copy through a shared-memory buffer, which may or may not affect the run time of the MPI job.

Read here about the shared-memory BTLs and the zero(single?)-copy mechanisms in Open MPI.