3

Following this post, I want to know how I can pass a list of strings from Python to C (i.e., using C headers and syntax, not C++), through Pybind11. I'm completely aware of the fact that Pybind11 is a C++ library and codes must be compiled by a C++ compiler anyway. However, it is difficult for me to understand the C++ implementations, for example here and here.

Here I tried to pass a python list of strings by pointers, represented as integers, and then receive them by long* in C, but it didn't work.

The C/C++ code should be something like:

// example.cpp
#include <stdio.h>
#include <stdlib.h>

#include <pybind11/pybind11.h>

int run(/*<pure C or pybind11 datatypes> args*/){

    // if pybind11 data types are used convert them to pure C :
    // int argc = length of args
    // char* argv[] =  array of pointers to the strings in args, possible malloc

    for (int i = 0; i < argc; ++i) {
        printf("%s\n", argv[i]);
    } 

    // possible free

    return 0;
}

PYBIND11_MODULE(example, m) {

    m.def("run", &run, "runs the example");
}

A simple CMakeLists.txt example is also provided here. and the Python code can be something like this:

#example.py
import example

print(example.run(["Lorem", "ipsum", "dolor", "sit", "amet"]))

To avoid misunderstandings like this, please consider these points:

  • This is not an XY question, as the presumed Y problem has already been solved in the correct/canonical way using C++ headers/standard libraries and syntax (links above). The purpose of this question is pure curiosity. Solving the problem in a syntax I'm familiar with will help me to understand the underlying nature of pybind11 data types and functionality. Please do not try to find the Y problem and solve it.
  • I'm completely aware that pybind11 is a C++ library and the code must be compiled with a C++ compiler anyway.
  • I would appreciate it if you would consult me in the comments about required edits to my question, rather than doing it yourself. I know you want to help, but I have tried to frame my question as nicely as possible to avoid confusion.
  • I would appreciate it if you would avoid changing the none commented parts of my C/C++ and python codes, as much as possible.
  • I'm aware that using the term "C/C++" is wrong. I use the term to refer to a C++ code written in C syntax and using C headers. I'm sorry that I don't know a better way to call it.
  • As the commented parts of the example.cpp file indicates, it is ok to use pybind11 datatypes and then convert them to C. But I suspect a pure C solution might also be possible. For example, see this attempt.
tstenner
  • 10,080
  • 10
  • 57
  • 92
Foad S. Farimani
  • 12,396
  • 15
  • 78
  • 193
  • `the code must be compiled with a C++` - so write a short wrapper that takes `std::vector>` and converts it to `char **` – KamilCuk Feb 04 '20 at 23:55
  • @KamilCuk that's actually what most of the existing solutions have done. Please consider that I want to use only C syntax, headers and standard library. – Foad S. Farimani Feb 04 '20 at 23:58
  • @KamilCuk It is ok to use pybind11 datatypes like `py::list` and then convert them to pure C. – Foad S. Farimani Feb 05 '20 at 00:03
  • 1
    Conversion of C++ data types to structures that C code can understand, is not possible in a portable way. You'd have to understand how your specific C++ compiler and linker organize objects in memory, and even then, it's not trivial. – jwdonahue Feb 05 '20 at 00:34
  • 1
    I would also add that these formats change with different compiler and linker settings, as well as across versions. – jwdonahue Feb 05 '20 at 00:36

1 Answers1

2

Below I've reformatted the previous example code where I used C++ constructs, to only use C and pybind11 ones.

#include <pybind11/pybind11.h>
#include <stdio.h>

#if PY_VERSION_HEX < 0x03000000
#define MyPyText_AsString PyString_AsString
#else
#define MyPyText_AsString PyUnicode_AsUTF8
#endif

namespace py = pybind11;

int run(py::object pyargv11) {
int argc = 0;
char** argv = NULL;

PyObject* pyargv = pyargv11.ptr();
if (PySequence_Check(pyargv)) {
    Py_ssize_t sz = PySequence_Size(pyargv);
    argc = (int)sz;

    argv = (char**)malloc(sz * sizeof(char*));
    for (Py_ssize_t i = 0; i < sz; ++i) {
        PyObject* item = PySequence_GetItem(pyargv, i);
        argv[i] = (char*)MyPyText_AsString(item);
        Py_DECREF(item);
        if (!argv[i] || PyErr_Occurred()) {
            free(argv);
            argv = nullptr;
            break;
        }
    }
}

if (!argv) {
    //fprintf(stderr,  "argument is not a sequence of strings\n");
    //return;

    if (!PyErr_Occurred())
        PyErr_SetString(PyExc_TypeError, "could not convert input to argv");
    throw py::error_already_set();
}

for (int i = 0; i < argc; ++i)
    fprintf(stderr, "%s\n", argv[i]);

free(argv);

return 0;
}

PYBIND11_MODULE(example, m) {
m.def("run", &run, "runs the example");
}

Below I will heavily comment it out to explain what I'm doing and why.

In Python2, string objects are char* based, in Python3, they are Unicode based. Hence the following macro MyPyText_AsString that changes behavior based on Python version, since we need to get to C-style "char*".

#if PY_VERSION_HEX < 0x03000000
#define MyPyText_AsString PyString_AsString
#else
#define MyPyText_AsString PyUnicode_AsUTF8
#endif

The pyargv11 py::object is a thin handle on a Python C-API handle object; since the following code makes use of the Python C-API, it's easier to deal with the underlying PyObject* directly.

void closed_func_wrap(py::object pyargv11) {
    int argc = 0;            // the length that we'll pass
    char** argv = NULL;      // array of pointers to the strings

    // convert input list to C/C++ argc/argv :

    PyObject* pyargv = pyargv11.ptr();

The code will only accept containers that implement the sequence protocol and can thus be looped over. This covers the two most important ones PyTuple and PyList at the same time (albeit a tad slower than checking for those types directly, but this will keep the code more compact). To be fully generic, this code should also check for the iterator protocol (e.g. for generators and probably reject str objects, but both are unlikely.

    if (PySequence_Check(pyargv)) {

Okay, we have a sequence; now get its size. (This step is the reason why for ranges you'd need to use the Python iterator protocol since their size is typically not known (although you can request a hint).)

        Py_ssize_t sz = PySequence_Size(pyargv);

One part, the size is done, store it in the variable that can be passed on to other functions.

        argc = (int)sz;

Now allocate the array of pointers to char* (technically const char*,but that matters not here as we'll cast it away).

        argv = (char**)malloc(sz * sizeof(char*));

Next, loop over the sequence to retrieve the individual elements.

        for (Py_ssize_t i = 0; i < sz; ++i) {

This gets a single elemenent from the sequence. The GetItem call is equivalent to Pythons "[i]", or getitem call.

            PyObject* item = PySequence_GetItem(pyargv, i);

In Python2, string objects are char* based, in Python3, they are unicode based. Hence the following macro "MyPyText_AsString" that changes behavior based on Python version, since we need to get to C-style "char*".

The cast from const char* to char* here is in principle safe, but the contents of argv[i] must NOT be modified by other functions. The same is true for the argv argument of a main(), so I'm assuming that to be the case.

Note that the C string is NOT copied. The reason is that in Py2, you simply get access to the underlying data and in Py3, the converted string is kept as a data member of the Unicode object and Python will do the memory management. In both cases, we are guaranteed that their lifetimes will be at least as long as the lifetime as the input Python object (pyargv11), so at least for the duration of this function call. If other functions decide to keep pointers, copies would be needed.

            argv[i] = (char*)MyPyText_AsString(item);

The result of PySequence_GetItem was a new reference, so now that we're done with it, drop it:

            Py_DECREF(item);

It is possible that the input array did not contain only Python str objects. In that case, the conversion will fail and we need to check for that case, or "closed_function" may segfault.

            if (!argv[i] || PyErr_Occurred()) {

Clean up the memory previously allocated.

                free(argv);

Set argv to NULL for success checking later on:

                argv = nullptr;

Give up on the loop:

                break;

If the given object was not a sequence, or if one of the elements of the sequence was not a string, then we don't have an argv and so we bail:

    if (!argv) {

The following is a bit lazy, but probably better to understand if all you want to look at is C code.

        fprintf(stderr,  "argument is not a sequence of strings\n");
        return;

What you should really do, is check whether an error was already set (e.g. b/c of a conversion problem) and set one if not. Then notify pybind11 of it. This will give you a clean Python exception on the caller's end. This goes like so:

        if (!PyErr_Occurred())
            PyErr_SetString(PyExc_TypeError, "could not convert input to argv");
        throw py::error_already_set();       // by pybind11 convention.

Alright, if we get here, then we have an argc and argv, so now we can use them:

    for (int i = 0; i < argc; ++i)
        fprintf(stderr, "%s\n", argv[i]);

Finally, clean up the allocated memory.

    free(argv);

Notes:

  • I would still advocate for the use of at least std::unique_ptr as that makes life so much easier in case there are C++ exceptions thrown (from custom converters of any input object).
  • I was originally expecting to be able to replace all of the code with the one-liner std::vector<char*> pv{pyargv.cast<std::vector<char*>>()}; after #include <pybind11/stl.h>, but I found that that does not work (even as it does compile). Neither did using std::vector<std::string> (also compiles, but also fails at run-time).

Just ask if anything is still unclear.

EDIT: If you truly only want to have a PyListObject, just call PyList_Check(pyargv11.ptr()) and if true, cast the result: PyListObject* pylist = (PyListObject*)pyargv11.ptr(). Now, if you want to work with py::list, you can also use the following code:

#include <pybind11/pybind11.h>
#include <stdio.h>

#if PY_VERSION_HEX < 0x03000000
#define MyPyText_AsString PyString_AsString
#else
#define MyPyText_AsString PyUnicode_AsUTF8
#endif

namespace py = pybind11;

int run(py::list inlist) {
    int argc = (int)inlist.size();
    char** argv = (char**)malloc(argc * sizeof(char*));

    for (int i = 0; i < argc; ++i)
        argv[i] = (char*)MyPyText_AsString(inlist[i].ptr());

    for (int i = 0; i < argc; ++i)
        fprintf(stderr, "%s\n", argv[i]);

    free(argv);

    return 0;
}

PYBIND11_MODULE(example, m) {
    m.def("run", &run, "runs the example");
}

This code is shorter only b/c it has less functionality: it only accepts lists and it also is more clunky in error handling (eg. it will leak if passed in a list of integers due to pybind11 throwing an exception; to fix that, use unique_ptr as in the very first example code so that argv is freed on exception).

Wim Lavrijsen
  • 3,453
  • 1
  • 9
  • 21
  • thanks a lot man. you are awsome. mind if I reformat your post so it is more readable for me? unfortunately it will apply my edits without asking for your peer review. but you can revert or reedit any part you don't like. – Foad S. Farimani Feb 05 '20 at 07:02
  • I reformatted the post for better readability. I hope you don't mind it. – Foad S. Farimani Feb 05 '20 at 09:36
  • I have lots of questions, but if you don't mind I want to start with this: 1. why my attempt [here](https://gist.github.com/Foadsf/ac170ada8aeaec4a73842a5eba6e1179) doesn't work? why I can't pass a python list of strings to C by poiters and reading it over there by `long*`? – Foad S. Farimani Feb 05 '20 at 09:39
  • also, I don't think that you have defined `item`, I think you forgot to use `PySequence_GetItem`. – Foad S. Farimani Feb 05 '20 at 11:39
  • 1
    Definition of 'item' was there in the original code. Both for the use with `PySequence_GetItem` and for the needed `Py_DECREF` later (which is not needed in `PyList_GetItem` that you used, as that returns a borrowed reference). The code you linked works "as is", so I guess I'm looking at a version newer than your comment. I really recommend retaining the error handling, though. Using a `long*` won't work b/c you first need to get out of pybind11, which doesn't appear to have a good converter for you. – Wim Lavrijsen Feb 05 '20 at 15:20
  • Yes, I did update the Gist based on your code, and the error handling stuff I removed just to help understand the main functionality. Next step I want to replace the `py::object` with Cpython `PyListObject` if possible. – Foad S. Farimani Feb 05 '20 at 15:58
  • would you be kind to reedit your post and add the missing part? Also what I have in the GitHub Gist has a bug/issue. It prints one extra slice of memory. – Foad S. Farimani Feb 05 '20 at 22:25
  • 1
    Re-edited and made some notes about using Python `list` and added a py::list example code. – Wim Lavrijsen Feb 06 '20 at 03:49
  • Thanks a lot. you have already helped me a lot. Mind if I keep bothering you witth more questions? – Foad S. Farimani Feb 06 '20 at 09:11
  • 1
    Sure; or open a new one if this wall of text gets too long. If it has "pybind11" in there, I'll come across it soon enough. – Wim Lavrijsen Feb 06 '20 at 15:28