6

I've taken over some code, and came across a weird reallocation of an array. This is a function from within an Array class (used by the JsonValue)

void reserve( uint32_t newCapacity ) {
    if ( newCapacity > length + additionalCapacity ) {
        newCapacity = std::min( newCapacity, length + std::numeric_limits<decltype( additionalCapacity )>::max() );
        JsonValue *newPtr = new JsonValue[newCapacity];

        if ( length > 0 ) {
            memcpy( newPtr, values, length * sizeof( JsonValue ) );
            memset( values, 0, length * sizeof( JsonValue ) );
        }

        delete[] values;

        values = newPtr;
        additionalCapacity = uint16_t( newCapacity - length );
    }
}

I get the point of this; it is just allocating a new array, and doing a copy of the memory contents from the old array into the new array, then zero-ing out the old array's contents. I also know this was done in order to prevent calling destructors, and moves.

The JsonValue is a class with functions, and some data which is stored in a union (string, array, number, etc.).

My concern is whether this is actually defined behaviour or not. I know it works, and has not had a problem since we began using it a few months ago; but if its undefined then it doesn't mean it is going to keep working.

EDIT: JsonValue looks something like this:

struct JsonValue {
// …
    ~JsonValue() {
        switch ( details.type ) {
        case Type::Array:
        case Type::Object:
            array.destroy();
            break;
        case Type::String:
            delete[] string.buffer;
            break;
        default: break;
        }
    }
private:
    struct Details {
        Key key = Key::Unknown;
        Type type = Type::Null; // (0)
    };

    union {
        Array array;
        String string;
        EmbedString embedString;
        Number number;
        Details details;
    };
};

Where Array is a wrapper around an array of JsonValues, String is a char*, EmbedString is char[14], Number is a union of int, unsigned int, and double, Details contains the type of value it holds. All values have 16-bits of unused data at the beginning, which is used for Details. Example:

struct EmbedString {
    uint16_t : 16;
           char buffer[14] = { 0 };
};
ChrisMM
  • 8,448
  • 13
  • 29
  • 48
  • Where does `values` come from? Is it an instance variable, or global? – SimonC Dec 03 '19 at 10:46
  • @ChrisMM This statement memset( values, 0, length * sizeof( JsonValue ) ) does not make sense.; – Vlad from Moscow Dec 03 '19 at 10:48
  • @SimonC, it is a class member variable. Sorry, the `reserve` function is part of a class. – ChrisMM Dec 03 '19 at 10:49
  • @VladfromMoscow clarify that last point, why does the `memset` call not make sense? – Mansoor Dec 03 '19 at 10:49
  • @Mansoor Because the memory pointed to by the pointer values is freed at once.:) – Vlad from Moscow Dec 03 '19 at 10:50
  • @VladfromMoscow, I think that is used to prevent the destructors of the JsonValue class which were stored in the `values` array from being called? – ChrisMM Dec 03 '19 at 10:51
  • @ChrisMM It is a bad idea. – Vlad from Moscow Dec 03 '19 at 10:52
  • @VladfromMoscow, I am not disagreeing at all on that. Just wondering if it is _valid_ – ChrisMM Dec 03 '19 at 10:52
  • It won't prevent the destructor(s) being called, they will be called with whatever `0` does to the class's members. This might be safe or it might not depends on the declaration of `JsonValue` Won't go as far as to say UB but is possible. – Richard Critten Dec 03 '19 at 10:53
  • @RichardCritten, thanks for the clarification. The destructor checks is the enum value it contains is 0, if it is, it doesn't do anything. – ChrisMM Dec 03 '19 at 10:54
  • What is `details.type` definition? What is the value for `details.type` if it's equal to 0? – KamilCuk Dec 03 '19 at 11:22
  • @KamilCuk, sorry, added the definition too. `Type::Null` is 0, so the `switch` statement in the destructor would not go to either case. – ChrisMM Dec 03 '19 at 11:24
  • 1
    A *possible* reason for explicitly setting the `values` memory to zero before (immediately) deleting the array is to prevent security leaks - i.e. some other program hacking into the freed memory, which *may* still contain (sensitive) data. – Adrian Mole Dec 03 '19 at 11:39
  • Stopping the the destructors from being called while not stopping the constructors from being called in `new JsonValue[]` is inconsistent. If constructor/destructor calls should be avoided, the memory should be allocated as a `char` array, and the pointer cast to the correct type. Then you can copy in the bytes from the old array, and delete the old array as a `char` array. Or just use `::operator new()`/`::operator delete()` directly in this case. – cmaster - reinstate monica Dec 03 '19 at 14:29
  • @cmaster Wouldn't omitting the `new` bring problems with lifetime? At what point would the lifetime of the copied objects begin? – n314159 Dec 03 '19 at 14:31
  • @n314159 Since you are treating the array as an array of PODs by moving it with `memcpy()`, there is no lifetime that needs to begin. There is an explicit exception in the strict aliasing rules within the standard that allows for allocating a `char` array, initializing it with bytes that form a valid representation of the type to which you cast the pointer, and the accessing the memory through that casted pointer. If `JsonValue` is an object that needs a lifetime to begin, you are not allowed to skip either the constructor or the destructor. – cmaster - reinstate monica Dec 03 '19 at 15:05
  • 1
    This looks like a home grown replacement of std::vector to me. I wonder what is the reason for not using std::vector in the first place. – n. m. could be an AI Dec 03 '19 at 15:07

1 Answers1

4

Whether this code has well-defined behavior basically depends on two things: 1) is JsonValue trivially-copyable and, 2) if so, are a bunch of all-zero Bytes a valid object representation for a JsonValue.

If JsonValue is trivially-copyable, then the memcpy from one array of JsonValues to another will indeed be equivalent to copying all the elements over [basic.types]/3. If all-zeroes is a valid object representation for a JsonValue, then the memset should be ok (I believe this actually falls into a bit of a grey-area with the current wording of the standard, but I believe at least the intention would be that this is fine).

I'm not sure why you'd need to "prevent calling destructors and moves", but overwriting objects with zeroes does not prevent destructors from running. delete[] values will call the destructurs of the array members. And moving the elements of an array of trivially-copyable type should compile down to just copying over the bytes anyways.

Furthermore, I would suggest to get rid of these String and EmbedString classes and simply use std::string. At least, it would seem to me that the sole purpose of EmbedString is to manually perform small string optimization. Any std::string implementation worth its salt is already going to do exactly that under the hood. Note that std::string is not guaranteed (and will often not be) trivially-copyable. Thus, you cannot simply replace String and EmbedString with std::string while keeping the rest of this current implementation.

If you can use C++17, I would suggest to simply use std::variant instead of or at least inside this custom JsonValue implementation as that seems to be exactly what it's trying to do. If you need some common information stored in front of whatever the variant value may be, just have a suitable member holding that information in front of the member that holds the variant value rather than relying on every member of the union starting with the same couple of members (which would only be well-defined if all union members are standard-layout types that keep this information in their common initial sequence [class.mem]/23).

The sole purpose of Array would seem to be to serve as a vector that zeroes memory before deallocating it for security reasons. If this is the case, I would suggest to just use an std::vector with an allocator that zeros memory before deallocating instead. For example:

template <typename T>
struct ZeroingAllocator
{
    using value_type = T;

    T* allocate(std::size_t N)
    {
        return reinterpret_cast<T*>(new unsigned char[N * sizeof(T)]);
    }

    void deallocate(T* buffer, std::size_t N) noexcept
    {
        auto ptr = reinterpret_cast<volatile unsigned char*>(buffer);
        std::fill(ptr, ptr + N, 0);
        delete[] reinterpret_cast<unsigned char*>(buffer);
    }
};

template <typename A, typename B>
bool operator ==(const ZeroingAllocator<A>&, const ZeroingAllocator<B>&) noexcept { return true; }

template <typename A, typename B>
bool operator !=(const ZeroingAllocator<A>&, const ZeroingAllocator<B>&) noexcept { return false; }

and then

using Array = std::vector<JsonValue, ZeroingAllocator<JsonValue>>;

Note: I fill the memory via volatile unsigned char* to prevent the compiler from optimizing away the zeroing. If you need to support overaligned types, you can replace the new[] and delete[] with direct calls to ::operator new and ::operator delete (doing this will prevent the compiler from optimizing away allocations). Pre C++17, you will have to allocate a sufficiently large buffer and then manually align the pointer, e.g., using std::align

Michael Kenzel
  • 15,508
  • 2
  • 30
  • 39
  • I don't understand why it should leak memory. Of course `delete values[];` does not free the memory, but the contained pointers were copied before, so the new object can delete them later. Furthermore, are you sure that this does not go into UB regarding lifetime? According to the standard, the lifetime of the objects pointed to by `values` ends since their storage is released. But the lifetime of their copies does never begin, since there is no proper initilization of them, if I see this correctly. Or can `memcpy` perform initialization? – n314159 Dec 03 '19 at 14:14
  • @n314159 Thanks for pointing that out, I did indeed somehow miss the fact that the pointers were copied into the new array before deleting the old array. I removed that bit from my answer. Apart from that, I'm pretty sure there are no lifetime issues here. `memcpy` does not create objects. But the lifetime of the elements of the new array was already begun by `new JsonValue[newCapacity]` and the `memcpy` does not end it… – Michael Kenzel Dec 03 '19 at 14:46
  • I definitely think that *"this is a very 'C'-ish way of doing things."* C++ has plenty of fine container-classes which can easily accommodate a varying number of anythings. Is it possible to use one of those instead? – Mike Robinson Dec 03 '19 at 14:47
  • @MikeRobinson of course one could use standard library containers instead. And I would definitely recommend doing so as per my answer above… – Michael Kenzel Dec 03 '19 at 14:49
  • @MichaelKenzel Yes, I saw that too, when I was halfway through writing an answer claiming UB regarding lifetimes @`new`^^. Do you think it would be a good idea to write the `move`-constructor of `JsonValue` in such a way that it memcpys and zeros-out the moved-from type? Then one could do all of this just as a normal move. Of course that would imply that one would always zeros the `JsonValue` but if that is indeed due to a security concern, that seems appropriate, anyway. – n314159 Dec 03 '19 at 14:55
  • @MichaelKenzel, security is a small concern, but not too much. I think the zero-ing was just to prevent the destructor from freeing memory it shouldn't, but is easier than setting `type = Type::Null` for each object. I believe the reason why the two strings were used instead of `std::string` is for memory reasons. `sizeof( std::string )` is 32, where as `sizeof( JsonValue )` is 16. Prior to this implementation, it was just using regular `nlohmann::json` objects, which I believe do use `std::string`, and this was using 20GB+ of memory to process some of the files. – ChrisMM Dec 03 '19 at 15:07
  • Getting rid of `String` without getting rid of the `memcpy()` is a really, really bad idea: `std::string` typically implements the small-string-optimization that puts the string's data within the `std::string` object itself **and points the internal storage pointer to that internal buffer**. When you move such an `std::string` with `memcpy()`, the pointer of the copy will still point into the original's object, leading to all kinds of undefined behavior. You can only move an object containing pointers with `memcpy()` if they do not target the memory region that you move. – cmaster - reinstate monica Dec 03 '19 at 15:11
  • It should be noted that `std::is_trivially_copyable::value` returns false, since there are move assignment and move constructors, which just swap the unions. – ChrisMM Dec 03 '19 at 15:14
  • @ChrisMM if the zeroing is just about not calling destructors, why don't you simply use an `std::vector`? I don't believe that moving the objects of it independently is much slower than `memcpy` the whole at once and then also zeroing the whole segment. – n314159 Dec 03 '19 at 15:31
  • @MichaelKenzel You say that it is defined if the Type is trivially copyable. Since it is not, do you know if the reverse holds? Is it then definitely UB? – n314159 Dec 03 '19 at 15:33
  • @n314159, I didn't write the code, just inherited it, so I really don't know. I think the idea was for performance, since vectors reallocation would presumably use move. I don't know for sure if `memcpy` would be faster or not, but I imagine it would be, at least a little bit. Also, I think for size, since `vector` is 24 – ChrisMM Dec 03 '19 at 15:34
  • @ChrisMM how old is the code? Maybe it predates move semantics. I think it would be worth to profile that. Maybe one could also use memcpy in the move constructor of `JsonValue` if it is indeed faster. That would still be cleaner in my opinion. – n314159 Dec 03 '19 at 15:42
  • @n314159, less than a year old, and the code has always been with C++17 on this branch. – ChrisMM Dec 03 '19 at 15:45
  • @cmaster I thought my answer made it clear that all this only works if `JsonValue` is trivially-copyable. But you're right to point that out. I added a sentence noting that `std::string` is generally not trivially-copyable and, thus, the whole `memcpy` dance won't work if you use that as your string type… – Michael Kenzel Dec 03 '19 at 15:59
  • @n314159 that is a very good question. I have been wondering about that myself. I can't really find anything in the standard that explicitly marks `memcpy` on a non trivially-copyable type as UB (I'm not convinced by the argument that doing so ends the object's lifetime as I fail to find anything that would say that in the standard as well). But the standard does not specify what the behavior of `memcpy` on a non trivially-copyable object would be. So whatever it is, it's at best neither useful nor reliable… – Michael Kenzel Dec 03 '19 at 16:05
  • @ChrisMM If `std::is_trivially_copyable_v` is false, then this code will not have well-defined behavior. As noted in the comment above, it does not seem that the standard explicitly marks this as UB. But the standard does not say anything about what the behavior would be either, which is about as good as UB if you ask me… – Michael Kenzel Dec 03 '19 at 16:08
  • @MichaelKenzel Found [this](https://stackoverflow.com/questions/29777492/why-would-the-behavior-of-stdmemcpy-be-undefined-for-objects-that-are-not-triv) regarding the definedness of memcpy on not trivial constructable objects (via cppref to memcpy) – n314159 Dec 03 '19 at 16:17
  • @ChrisMM I also think that the conclusion of the accepted answer fits nicely to this circumstances. Also, do you really think that implementing the move constructor and move assignment is necessary? From your description it would seem like they do what `= default` does too. – n314159 Dec 03 '19 at 16:25
  • @n314159 I also stumbled upon this already. But none of the answers there provide a quote from the standard to back up their claims. I cannot find anything beyond what [\[basic.life\]/1.4](https://timsong-cpp.github.io/cppwp/n4659/basic.life#1.4) says. And I don't see how memcpy would cause the storage to be reused (my understanding would be that reusing storage requires a new object to be created within said storage, which `memcpy` doesn't do). I especially cannot find anything that would limit "storage being reused" to the case of an object of non trivially-copyable type… – Michael Kenzel Dec 03 '19 at 16:26
  • @n314159 Well, I don't depend on the effects of the destructor. The corresponding cppreference article doesn't have that line anymore though, it just says "may be undefined" now. So from what I'm understanding from both you and MichaelKenzel, is that the behaviour isn't defined, and is _potentially_ UB. – ChrisMM Dec 03 '19 at 16:26
  • @n314159, just saw the edit to your comment. I don't think `= default` would work, since that wouldn't change the values in the original. At the very least, `detail.type` would need to be set to `Type::Null` to prevent the destructor from doing anything. – ChrisMM Dec 03 '19 at 16:59
  • @n314159 I see why zeroing in the move-constructor might seem like a good idea. However, I'd say that you can then still copy and have no guarantee that the original will ever be zeroed. The destructor would seem a better place to do zeroing of the memory occupied by `this`. However, if we take a vector of such self-zeroing objects, I very much doubt that compilers will be able to coalesc the volatile zeroing of the individual element destructors into zeroing one large block. Thus, you'll end up calling lots of small memsets in a loop… – Michael Kenzel Dec 03 '19 at 17:33
  • …And purely on a conceptual level, I'd argue that zeroing of memory should more be a concern of where and how objects are stored than it should be an added responsibility of a particular type… – Michael Kenzel Dec 03 '19 at 17:34
  • Yes, I see that defaulting will not work. I mixed up the end-condition for the the moved-from object. The state after moving is unspecified and that means it can still be the same. I should have thought of this just because my idea violates the rule of 5.^^ – n314159 Dec 03 '19 at 18:33