11

I want to set the padding bytes of a class to 0, since I am saving/loading/comparing/hashing instances at a byte level, and garbage-initialised padding introduces non-determinism in each of those operations.

I know that this will achieve what I want (for trivially copyable types):

struct Example
{
    Example(char a_, int b_)
    {
        memset(this, 0, sizeof(*this));
        a = a_;
        b = b_;
    }
    char a;
    int b;
};

I don't like doing that though, for two reasons: I like constructor initialiser lists, and I know that setting the bits to 0 isn't always the same as zero-initialisation (e.g. pointers and floats don't necessarily have zero values that are all 0 bits).

As an aside, it's obviously limited to types that are trivially copyable, but that's not an issue for me since the operations I listed above (loading/saving/comparing/hashing at a byte level) require trivially copyable types anyway.

What I would like is something like this [magical] snippet:

struct Example
{
    Example(char a_, int b_) : a(a_), b(b_)
    {
        // Leaves all members alone, and sets all padding bytes to 0.
        memset_only_padding_bytes(this, 0);
    }
    char a;
    int b;
};

I doubt such a thing is possible, so if anyone can suggest a non-ugly alternative... I'm all ears :)

Ben Hymers
  • 25,586
  • 16
  • 59
  • 84
  • 1
    My suggestion: 1. just do the `memset()`, then initialize explicitly (essentially your first approach), 2. exclude the padding bytes from the hash altogether. –  Oct 23 '13 at 15:12
  • 2
    "I don't like doing that though, for two reasons ..." Note that the second reason is irrelevant, because the bytes allocated to members will be written over by the constructor. – Sergey Kalinichenko Oct 23 '13 at 15:13
  • 1
    My suggestion would be instead of using `byte pad;` create a padding type that can auto zero itself (e.g. a templated type, so you can control the size `PadData pad;`) – benjymous Oct 23 '13 at 15:14
  • 2
    Note that the root cause of your problem is your desire to hash at the byte level. At this point, you are fighting the consequence of this poor decision. – Sergey Kalinichenko Oct 23 '13 at 15:16
  • 3
    It's only a poor decision if I don't get a good answer to this question ;) – Ben Hymers Oct 23 '13 at 15:20
  • randomly thinking out of my ass: have static members showing all the pad offsets and their sizes (list of pairs?). Instantiate one garbage object (to eliminate compiler padding differences), take the pointer math of the member locations vs the class pointer/sizeof, and voila - you have your padding locations/sizes at the other locations -> put them in the static members. On any subsequent constructor call, call a static function that zeros those. ... ? maybe? – im so confused Oct 23 '13 at 15:21
  • thinking more about it, i'm liking my solution ... as much as one could like a solution for an unnecessary problem haha – im so confused Oct 23 '13 at 15:24
  • You're pretty much on lost postion : http://stackoverflow.com/questions/14691237/function-that-calculate-size-of-structure-without-padding-bytes – Agent_L Oct 23 '13 at 15:34
  • @Agent_L if he can rewrite the constructor, he knows the members, which means the accepted answer there does not apply – im so confused Oct 23 '13 at 15:38
  • I tend to use this little snippet for the purpose: `template class PCPadding;` `template <> class PCPadding<1> { public: pcu8 m; PCPadding<1>():m(0u){}};` `template <> class PCPadding<2> { public: pcu16 m; PCPadding<2>():m(0u){}};` `template <> class PCPadding<4> { public: pcu32 m; PCPadding<4>():m(0u){}};` `template <> class PCPadding<8> { public: pcu64 m; PCPadding<8>():m(0u){}};` `#define PC_PADDING(a) PCPadding MPI_TOKPASTE(_PADDING,__LINE__)` – Dimo Markov May 17 '16 at 14:46

3 Answers3

8

There's no way I know of to do this fully automatically in pure C++. We use a custom code generation system to accomplish this (among other things). You could potentially accomplish this with a macro to which you fed all your member variable names; it would simply look for holes between offsetof(memberA)+sizeof(memberA) and offsetof(memberB).

Alternatively, serialize/hash on a memberwise basis, rather than as a binary blob. That's ten kinds of cleaner.

Oh, one other option -- you could provide an operator new which explicitly cleared the memory before returning it. I'm not a fan of that approach, though..... it doesn't work for stack allocation.

Sneftel
  • 40,271
  • 12
  • 71
  • 104
  • 2
    "serialize/hash on a memberwise basis" - I absolutely know this is the right thing to do, I just don't want to type out all those member names for each class, for each operation... the irony though is that I've already typed far more than that just by asking this question :) – Ben Hymers Oct 23 '13 at 15:36
  • Side note: I really with C++ had reflection. – Ben Hymers Oct 23 '13 at 15:37
  • @BenHymers Have a look at [Boost.Serialization](http://www.boost.org/doc/libs/1_54_0/libs/serialization/doc/index.html) - that would take out a lot of the work of performing your own serialization and/or memberwise hashing. – JBentley Oct 23 '13 at 15:39
  • Awwww, you're right, I should just bite the bullet and do everything member-wise. Clang-extract looks excellent, too! JBentley, Boost.Serialization is great (I use it on other projects) but I can't use it for this project - good suggestion otherwise though. – Ben Hymers Oct 23 '13 at 16:15
3

You should never use padded structs when binary writing/reading them. Simply because the padding can vary from one platform to another which will lead to binary incompatibility.

Use some compiler options, like #pragma pack (push, 1) to disable padding when defining those writable structs and restore it with #pragma pack(pop).

This sadly means you're losing the optimization provided by it. If that is a concern, by carefully designing your structs you can manually "pad" them by inserting dummy variables. Then zero-initialization becomes obvious, you just assign zeros to those dummies. I don't recommend that "manual" approach as it's very error-prone, but as you're using binary blob write you're probably concerned already. But by all means, benchmark unpadded structs before.

Agent_L
  • 4,960
  • 28
  • 30
  • 1
    I've tried using pragmas and similar options for other compilers to share serialised data between programs on different platforms, but found them to be unreliable and ended up doing member-wise serialisation instead. If it did work the same way for all the platforms I care about, then this would definitely be the way to go (in my case). – Ben Hymers Oct 23 '13 at 16:10
  • It's strange what you're saying, because packing options should be reliable, they are essential to binary write. Maybe you should ask another question regarding packing commands in compilers you're using. – Agent_L Oct 23 '13 at 16:21
  • @BenHymers Here they are saying that `#pragma pack` is directly supported on many mainstream compilers: http://stackoverflow.com/questions/13927273/how-to-translate-struct-packing-from-vc-to-gcc OFC then there is another problem if everyone of your target architectures support unaligned variables - which may arise even on x86 when using SSE – Agent_L Oct 23 '13 at 16:37
3

I faced a similar problem - and simply saying that this is a poor design decision (as per dasblinkenlight's comment) doesn't necessarily help as you may have no control over the hashing code (in my case I was using an external library).

One solution is to write a custom iterator for your class, which iterates through the bytes of the data and skips the padding. You then modify your hashing algorithm to use your custom iterator instead of a pointer. One simple way to do this is to templatize the pointer so that it can take an iterator - since the semantics of a pointer and an iterator are the same, you shouldn't have to modify any code beyond the templatizing.

EDIT: Boost provides a nice library which makes it simple to add custom iterators to your container: Boost.Iterator.

Whichever solution you go for, it is highly preferable to avoid hashing the padding as doing so means that your hashing algorithm is highly coupled with your data structure. If you switch data structures (or as Agent_L mentions, use the same data structure on a different platform which pads differently), then it will produce different hashes. On the other hand, if you only hash the actual data itself, then you will always produce the same hash values no matter what data structure you use later.

JBentley
  • 6,099
  • 5
  • 37
  • 72