Is using the C code to create Python extensions UB in C++ due to lifetime?

Question

There are many tutorials out there on how to create C extensions of Python which introduce a new type. One example: https://docs.python.org/3.5/extending/newtypes.html

This usually boils down to creating a struct like:

struct Example
{
    PyObject_HEAD
    // Extra members
};

And then registering it in a module by implicitely or explicitely defining function pointer mappings. The lifetime related ones are tp_alloc, tp_new, tp_init, tp_free, tp_dealloc.

From what I understand of how this works is that PyObject_HEAD expands to PyObject ob_base; which makes Example* and PyObject* convertible (I guess there is some special wording if it is the first member), so all code accepts PyObject* and can work with it as-if struct Example: public PyObject{}; was used. All good so far.

But now the problem is the lifetime if Example: After some digging it seems that the following happens:

tp_new is called with the "type_info" (function pointer mapping) of the object to create
this calls tp_alloc which defaults to (basically) malloc
then tp_init is called with the memory pointer from tp_new which e.g. populates the ref counter
on destruction tp_dealloc is called
this calls tp_free (basically free)

So what is obviously missing is a call to the constructor and destructor which is fine in practice if the struct is a POD

However recent C++ standards have made it clear, that simply mallocing an object is not enough, see e.g. std::launder and related discussions.

Hence is compiling such a C extension as C++ already UB? If not, I guess there is a special rule for PODs, so those would be safe, wouldn't they? Are there any references for clarification?

Is there any documentation on a safe way to create non-POD types in a performant manner? I.e. not adding a Pointer to the Example POD object above which points to that non-POD object which is then created via new or similar.

From the description and the answer to Should {tp_alloc, tp_dealloc} and {tp_new, tp_free} be considered as pairs? I would distill that tp_new could do a new and return that, or call tp_alloc and do a placement new on the returned memory and return that. This sounds to me as the "only as much further initialization as is absolutely necessary" requirement. tp_dealloc would then call the destructor and forward to tp_free. Sounds good but may this be problematic if alignment of the tp_alloc returned memory is wrong?

Are there guarantees that tp_new and tp_dealloc are called exactly once?

Some pseudo-Code for non-Python programmers according to above description:

PyObject* tp_alloc(size_t n){ return malloc(n); }
PyObject* tp_new(PyTypeObject* typeInfo){ return typeinfo->tp_alloc(typeinfo->object_size); }
PyObject* tp_init(PyTypeObject* typeInfo, PyObject* o){ o->typeInfo = typeInfo; o->refCnt = 1; return o }
void tp_dealloc(PyObject* o){ o->typeInfo->tp_free(o); }
void tp_free(void* m){ free(m); }

//User code
struct Example
{
    PyObject obj;
    // Extra members
};
void register(){
  PyTypeObject info = {.tp_alloc = tp_alloc, .tp_new = tp_new, .object_size = sizeof(Example), ...}
  PythonRegister("Example", info);
}

Note that this is simplified. Python will then use the info object whenever a type of name "Example" is created/used. And you can override all functions and convert between Example* and PyObject* although there is no inheritance, as they are "pointer-interconvertible" by:

one is a standard-layout class object and the other is the first non-static data member of that object https://en.cppreference.com/w/cpp/language/static_cast

My idea was now to override the default tp_new by something like:

PyObject* Example_new(PyTypeObject* typeInfo){ return new(typeinfo->tp_alloc(typeinfo->object_size)) Example; }

What I wanted to know if this is required and valid at all.

Do note that POD no longer exists in C++. It was removed in C++11 and replaced with standard layout class. You get some guarantees with them, but C++'s strict aliasing rules are still going to bite you. What you can do is compile the C code as C and link it into your C++ project. — NathanOliver, May 26 '20 at 15:49
Does it? I've seen `is_pod` added in C++11 which is "standard layout + trivial". But the exact name isn't important, it seems you understood what I meant. Wasn't there something about "pointer-interconvertible" which covers that case? I'm also not sure that this 2 step compilation is really possible. What happens with setuptools is an invocation of `gcc ... myfile.cpp -std=c++11` which does invoke the C compiler but (I assume) compiles the code as C++. Changing languages in 1 extensions doesn't seem to be possible especially as I need to access the C++ class in C too for the binding — Flamefire, May 26 '20 at 15:58
I think, strict aliasing rule is different from "trivialness" of types, and it looks like you are munching those too a bit. Also, probably adding more code would help people not familiar with Python (like myself) to understand the behavior. — SergeyA, May 26 '20 at 16:00
Worth mentioning that in C++20 this becomes a non-issue. Assuming the types in question are "well-behaved". — StoryTeller - Unslander Monica, May 26 '20 at 16:02
@Flamefire Ah it looks like I had my standards wrong. C++11 was the start of the new requirements, and it was deprecated in C++20. Without seeing concrete code, all I can assume is the C way is not the C++ way and is UB instead until (possibly depending on the code) C++20. — NathanOliver, May 26 '20 at 16:03
Related to: [Is circumventing a class' constructor legal or does it result in undefined behaviour?](https://stackoverflow.com/questions/37644977/is-circumventing-a-class-constructor-legal-or-does-it-result-in-undefined-behav/61999151). Note that [P0593R6](https://wg21.link/P0593R6) changes how this is viewed, into being OK, for: _implicitly-created objects whose address is the address of the start of the region of storage, and produce a pointer value that points to that object, if that value would result in the program having defined behavior_. — Amir Kirsh, May 26 '20 at 17:40
I added some code transcribing my words to give you an idea what happens. Not 100% sure though as I'm new to this too. @AmirKirsh If I read that correctly than this means the behavior for trivial types has been retroactively been defined for C++ (all). So as long as `Example` stays trivial then I'll have no UB. IMO this matches existing behavior because I mean what is the compiler supposed to do if it doesn't even know about the code creating the object but merely being handed a pointer to it? How would it know if the ctor was called or not when it has no effect (due to trivial requirement)? — Flamefire, May 27 '20 at 07:12
The key is in the last words of the sentence, in the spec, describing the validity of the pointer to an object as: _if that value would result in the program having defined behavior_. A bit of tautology for defining the conditions for having a "defined behavior". It seems to me that your example falls in the category of defined behavior. But I believe some examples of cases where such an object would have _undefined behavior_ are necessary. — Amir Kirsh, May 27 '20 at 07:28

Is using the C code to create Python extensions UB in C++ due to lifetime?

0 Answers0