1

I have a class foo that manages data using small buffer optimization (SBO).
When size < 16, the data is held locally (in buffer), otherwise it is stored on the heap, with reserved holding the allocated space.

class foo {
    static const int sbo_size = 16;

    long size = 0;
    char *ptr;

    union {
        char buffer[sbo_size];
        long reserved;
    };
public:

    foo()
    {
        for (int i = 0; i < sbo_size; ++i)
            buffer[i] = 0;
    }

    void clone(const foo &f)
    {
        // release 'ptr' if necessary

        if (f.size < sbo_size)
        {
            memcpy(this, &f, sizeof(foo));
            ptr = buffer;
        } else
        {
            // handle non-sbo case
        }
    }
};

Question about clone():
With the SBO case, it may not be clear for the compiler that union::buffer will be used.
is it correct to use memcpy and set ptr accordingly?

timrau
  • 22,578
  • 4
  • 51
  • 64
Tootsie
  • 655
  • 3
  • 11
  • If I understand the question correctly, your concern is that after the `memcpy` in the "sbo" case (the `if` branch) you are wondering if `f.buffer` would be guaranteed to be considered the "active" union member for `f`? – François Andrieux Mar 15 '22 at 15:52
  • @FrançoisAndrieux @hyde The concern is related to the UB you get when using a `union` for type punning – Tootsie Mar 15 '22 at 15:57
  • 1
    The code above is not a small buffer optimisation. SBO would be when `static const int sbo_size = sizeof(void*); long size = 0; union { char *ptr; char buffer[sbo_size]; };` is used. – 273K Mar 15 '22 at 15:57
  • @Tootsie It looks like `memcpy` is compatible with unions, it figuratively "copies" which member is active. See the duplicate for details. Edit : the question refers to copying over a specific union members, but the top answer talks about copying a whole union as well. – François Andrieux Mar 15 '22 at 15:58
  • @FrançoisAndrieux Reopened and posted an answer which doesn't use unions, – Paul Sanders Mar 15 '22 at 17:39
  • @FrançoisAndrieux `memcpy` must work on all "PODs" or whatever you want to call these, but the std was never very specific about the fact unions must be supported, like, at all, in C++. It's the intent, and *intent is more important than written law* (in programming). – curiousguy Mar 24 '23 at 00:44

1 Answers1

2

If you can use C++17, I would side-step any potential type-punning problems by using std::variant in place of a union.

Although this uses a small amount of storage internally to keep track of the current type it contains, it's probably a win overall as your ptr variable can disappear (although that should be inside your union anyway).

It's also typesafe, which a union is not (because std::get will throw if the variant doesn't contain the desired type) and will keep track of the type of data it contains simply by assigning to it.

The resulting class fragment might look something like this (no doubt this code can be improved):

class foo
{
private:
    static const size_t sbo_size = 16;
    using small_buf = std::array <char, sbo_size>;
    size_t size = 0;
    std::variant <small_buf, char *> buf = { };

public:
    void clone (const foo &f)
    {
        char **bufptr = std::get_if <char *> (&buf);
        if (bufptr)
            delete [] *bufptr;

        size = f.size;
        if (size < sbo_size)
            buf = std::get <small_buf> (f.buf);
        else
        {
            buf = new char [size];
            std::memcpy (std::get <char *> (buf), std::get <char *> (f.buf), size);
        }
    }
};

Notes:

  • You will see that I've used std::array instead of a C-style array because std:array has lots of nice features that C-style arrays do not

  • Why clone and not a copy constructor?

  • if you want foo to have an empty state (after being default constructed, say), then you can look into the strangely named std::monostate.

  • For raw storage, std::byte is probably to be preferred over char.

Fully worked example here.


Edit: To answer the question as posed, I am no language lawyer but it seems to me that, inside clone, the compiler has no clue what the active member of f might be as it has, in effect, been parachuted in from outer space.

In such circumstances, I would expect compiler writers to play it safe and set the active member of the union to "don't know" until some concrete information comes along. But (and it's a big but), I wouldn't like to bet my shirt on that. It's a complex job and compiler writers do make mistakes.

So, in a spirit of sharing, here's a slightly modified version of your original code which fixes that. I've also moved ptr inside your union since it clearly belongs there:

class foo {
    static const int sbo_size = 16;

    long size = 0;

    union {
        std::array <char, sbo_size> buffer;   // changing this
        char *ptr;
        long reserved;
    };
public:

    foo()
    {
        for (int i = 0; i < sbo_size; ++i)
            buffer[i] = 0;
    }

    void clone(const foo &f)
    {
        // release 'ptr' if necessary

        if (f.size < sbo_size)
        {
            buffer = f.buffer;                // lets me do this
            ptr = buffer.data ();
        } else
        {
            // handle non-sbo case
        }
    }
};

So you can see, by using std::array for buffer (rather than one of those hacky C-style arrays), you can directly assign to it (rather than having to resort to memcpy) and the compiler will then make that the active member of your union and you should be safe.

In conclusion, the question is actually rather meaningless since one shouldn't (ever) need to write code like that. But no doubt someone will immediately come up with something that proves me wrong.

Paul Sanders
  • 24,133
  • 4
  • 26
  • 48
  • +1 for helpful advice, and a good alternative, but I think it missed answering the question being asked. Answer that question directly would help complete this answer. My understanding of the question is whether it is well defined to dereference `ptr` after `memcpy(this, &f, sizeof(foo)); ptr = buffer;` considering that it is UB to read from a member of a union that was not the last member written to. – François Andrieux Mar 15 '22 at 17:43
  • @FrançoisAndrieux I guess so, but I don't like `union`s very much as they are such slippery creatures, so I thought it worth offering a 'more modern' alternative. And thanks for the +1, most kind. – Paul Sanders Mar 15 '22 at 18:48
  • @Paul Sanders I appreciate the advice.. `variant` is certainly a better choice for higher level code. However, in low-level implementations (such as `std::string`), they almost certainly use a `union` to implement SBO. – Tootsie Mar 15 '22 at 19:18
  • @FrançoisAndrieux OK, I have had a stab at that and posted an alternative which just tweaks the original code to make it safe. – Paul Sanders Mar 15 '22 at 19:18
  • @Tootsie Why do you need a 'low level' solution? If that's what you need, the information should be in your question from the outset. Also, I have added an alternative solution which should work for you whilst still using a union. – Paul Sanders Mar 15 '22 at 19:20
  • @FrançoisAndrieux I think I didn't phrase the question accurately... It's indeed whether `ptr` can be dereferenced after `memcpy`. With normal use of a `union`, there is only one active member; ie. the one that was last written to. Another member can be active with the same union bit pattern; the programmer knows and the compiler knows. With `memcpy`, the bits are literally copied, but how does the compiler know which member is active. – Tootsie Mar 15 '22 at 19:20
  • For example: `union un_t { double d; long l; }; un_t x, y; x.l = 1234; memcpy(&y, &x, sizeof(un_t));` Is reading `y.d` now undefined behavior while `y.l` is not ? Perhaps the compiler can figure out here what the active member is, but what if the source union was created in another thread ? – Tootsie Mar 15 '22 at 19:20
  • Why do you want to use `ptr` at all when you know (from the value in `size`) that the data is in fact in `buffer`? I think that's a key point that you are missing here. – Paul Sanders Mar 15 '22 at 19:21
  • @Tootsie In both cases (with or without `memcpy`) the compiler can't know which member is active, at least not generally. It's simply UB to access a member that wasn't the last one written to. How, or if that is enforced is unspecified because it falls into Undefined Behavior. A compiler can make assumptions in some cases though, which is why it is UB. – François Andrieux Mar 15 '22 at 19:22
  • 2
    @PaulSanders Not OP, but one reason I can think of is that it avoids a branch for accessing the data. – François Andrieux Mar 15 '22 at 19:25
  • @FrançoisAndrieux OK, makes sense. That would explain why it's outside the union. In any case, my second proposal works equally well regardless of where `ptr` actually lives. It is safe OP, you should use it. – Paul Sanders Mar 15 '22 at 19:37