returning numpy arrays via pybind11

Question

I have a C++ function computing a large tensor which I would like to return to Python as a NumPy array via pybind11.

From the documentation of pybind11, it seems like using STL unique_ptr is desirable. In the following example, the commented out version works, whereas the given one compiles but fails at runtime ("Unable to convert function return value to a Python type!").

Why is the smartpointer version failing? What is the canonical way to create and return a NumPy array?

PS: Due to program structure and size of the array, it is desirable to not copy memory but create the array from a given pointer. Memory ownership should be taken by Python.

typedef typename py::array_t<double, py::array::c_style | py::array::forcecast> py_cdarray_t;

// py_cd_array_t _test()
std::unique_ptr<py_cdarray_t> _test()
{
    double * memory = new double[3]; memory[0] = 11; memory[1] = 12; memory[2] = 13;
    py::buffer_info bufinfo (
        memory,                                   // pointer to memory buffer
        sizeof(double),                           // size of underlying scalar type
        py::format_descriptor<double>::format(),  // python struct-style format descriptor
        1,                                        // number of dimensions
        { 3 },                                    // buffer dimensions
        { sizeof(double) }                        // strides (in bytes) for each index
    );

    //return py_cdarray_t(bufinfo);
    return std::unique_ptr<py_cdarray_t>( new py_cdarray_t(bufinfo) );
}

Jason Rhinelander · Accepted Answer · 2018-02-06T14:07:38.503

A few comments (then a working implementation).

pybind11's C++ object wrappers around Python types (like pybind11::object, pybind11::list, and, in this case, pybind11::array_t<T>) are really just wrappers around an underlying Python object pointer. In this respect there are already taking on the role of a shared pointer wrapper, and so there's no point in wrapping that in a unique_ptr: returning the py::array_t<T> object directly is already essentially just returning a glorified pointer.
pybind11::array_t can be constructed directly from a data pointer, so you can skip the py::buffer_info intermediate step and just give the shape and strides directly to the pybind11::array_t constructor. A numpy array constructed this way won't own its own data, it'll just reference it (that is, the numpy owndata flag will be set to false).
Memory ownership can be tied to the life of a Python object, but you're still on the hook for doing the deallocation properly. Pybind11 provides a py::capsule class to help you do exactly this. What you want to do is make the numpy array depend on this capsule as its parent class by specifying it as the base argument to array_t. That will make the numpy array reference it, keeping it alive as long as the array itself is alive, and invoke the cleanup function when it is no longer referenced.
The c_style flag in the older (pre-2.2) releases only had an effect on new arrays, i.e. when not passing a value pointer. That was fixed in the 2.2 release to also affect the automatic strides if you specify only shapes but not strides. It has no effect at all if you specify the strides directly yourself (as I do in the example below).

So, putting the pieces together, this code is a complete pybind11 module that demonstrates how you can accomplish what you're looking for (and includes some C++ output to demonstrate that is indeed working correctly):

#include <iostream>
#include <pybind11/pybind11.h>
#include <pybind11/numpy.h>

namespace py = pybind11;

PYBIND11_PLUGIN(numpywrap) {
    py::module m("numpywrap");
    m.def("f", []() {
        // Allocate and initialize some data; make this big so
        // we can see the impact on the process memory use:
        constexpr size_t size = 100*1000*1000;
        double *foo = new double[size];
        for (size_t i = 0; i < size; i++) {
            foo[i] = (double) i;
        }

        // Create a Python object that will free the allocated
        // memory when destroyed:
        py::capsule free_when_done(foo, [](void *f) {
            double *foo = reinterpret_cast<double *>(f);
            std::cerr << "Element [0] = " << foo[0] << "\n";
            std::cerr << "freeing memory @ " << f << "\n";
            delete[] foo;
        });

        return py::array_t<double>(
            {100, 1000, 1000}, // shape
            {1000*1000*8, 1000*8, 8}, // C-style contiguous strides for double
            foo, // the data pointer
            free_when_done); // numpy array references this parent
    });
    return m.ptr();
}

Compiling that and invoking it from Python shows it working:

>>> import numpywrap
>>> z = numpywrap.f()
>>> # the python process is now taking up a bit more than 800MB memory
>>> z[1,1,1]
1001001.0
>>> z[0,0,100]
100.0
>>> z[99,999,999]
99999999.0
>>> z[0,0,0] = 3.141592
>>> del z
Element [0] = 3.14159
freeing memory @ 0x7fd769f12010
>>> # python process memory size has dropped back down

For anybody who comes across this now, creating and assigning the capsule object is no longer necessary. — Nimitz14, Jan 24 '20 at 19:03
@Nimitz14 What do you mean precisely by "is no longer necessary"? It seems by default the `array_t` ctor now copies the data unless you specify a base. Getting non-copy semantics still seems to require a capsule. — cfh, Feb 26 '20 at 11:45
Is it unnecessary to define a move return policy via `return_value_policy::move` in this case? Wondering if the returned `array_t` will be copied or not. — glinka, May 19 '21 at 14:06
providing a base argument `py::capsule` to `py::array_t` constructor as `py::array_t(fsv_size, fsv, capsule)` fails with `Windows fatal exception: code 0xc0000374` (Heap Corruption) on Windows while works without problems on unix and macos, more details [here](https://github.com/quantumlib/qsim/issues/346). Any suggestions why that's the case? — Laurynas Tamulevičius, May 24 '21 at 13:47
@Nimitz14 I know this was a long time ago, but any idea what the current recommended approach for returning NumPy array would be? — willem, Jan 06 '22 at 09:59
How I do it: don't allocate anything on the C++ side, just create a py::array_t with shape and strides and no other arguments. It will then allocate and handle its own memory. You can then use mutable_unchecked() to get a view of the array that lets you index without bound checks. Maybe I'll write up a quick answer — AwkwardWhale, Aug 30 '23 at 08:44

score 1 · Answer 2 · answered Oct 19 '21 at 21:12

I recommend using ndarray. A foundational principle is that the underlying data is never copied unless explicitly requested (or you quickly end up with huge inefficiencies). Below is an example of it in use, but there are other features I haven't shown, including conversion to Eigen arrays (ndarray::asEigen(array)), which makes it pretty powerful.

Header:

#ifndef MYTENSORCODE_H
#define MYTENSORCODE_H

#include "ndarray_fwd.h"

namespace myTensorNamespace {

ndarray::Array<double, 2, 1> myTensorFunction(int param1, double param2);

}  // namespace myTensorNamespace

#endif  // include guard

Lib:

#include "ndarray.h"
#include "myTensorCode.h"

namespace myTensorNamespace {

ndarray::Array<double, 2, 1> myTensorFunction(int param1, double param2) {
    std::size_t const size = calculateSize();
    ndarray::Array<double, 2, 1> array = ndarray::allocate(size, size);
    array.deep() = 0;  // initialise
    for (std::size_t ii = 0; ii < size; ++ii) {
        array[ii][ndarray::view(ii, ii + 1)] = 1.0;
    }
    return array;
}

}  // namespace myTensorNamespace

Wrapper:

#include "pybind11/pybind11.h"
#include "ndarray.h"
#include "ndarray/pybind11.h"

#include "myTensorCode.h"

namespace py = pybind11;
using namespace pybind11::literals;

namespace myTensorNamespace {
namespace {

PYBIND11_MODULE(myTensorModule, mod) {
    mod.def("myTensorFunction", &myTensorFunction, "param1"_a, "param2"_a);
}

}  // anonymous namespace
}  // namespace myTensorNamespace

AwkwardWhale · Answer 3 · 2023-08-30T12:38:16.120

Since the accepted answer seems to be out of date, I'll give a quick example of how to implement it with the current API. The key thing to keep in mind is that if we are constructing an array "from scratch", we don't need to manually allocate any data, we just need to give the py::array_t constructor our desired shape and strides, and it will manage its own memory.

#include <pybind11/pybind11.h>
#include <pybind11/numpy.h>

namespace py = pybind11;

PYBIND11_MODULE(numpywrap, m) {
    m.def("f", []() {
        constexpr size_t elsize = sizeof(double);
        size_t shape[3]{100, 1000, 1000};
        size_t strides[3]{1000 * 1000 * elsize, 1000 * elsize, elsize};
        auto a = py::array_t<double>(shape, strides);
        auto view = a.mutable_unchecked<3>();
        
        for(size_t i = 0; i < a.shape(0); i++)
        {
          for(size_t j = 0; j < a.shape(1); j++)
          {
            for(size_t k = 0; k < a.shape(2); k++)
            {
              view(i,j,k) = whatever_data_should_go_here(i,j,k);
            }
          }
        }
        return a;
    });
}

returning numpy arrays via pybind11

3 Answers3

Linked