Memory Efficiency Regarding The Use of Unions in C++

Question

Consider the three code blocks below, three alternative solutions to the string/union conflict.

Of these options, in general, which is more memory efficient regarding the use of unions in this fashion?

I'm looking for answers addressing principle here: that is, the primary purpose of unions is to save memory.

EDIT: A class assignment forced me to use unions in this way. It had me thinking which was the most efficient, and that's how I got here.

Code Block (A):

// unions with pointers to structs
struct HourlyInfo {
    string firstName;
    string lastName;
    string title;
    int hoursWorked;
    double hourlyRate;
};

struct SalaryInfo {
    string firstName;
    string lastName;
    string title;
    double salary;
    double bonus;
};

struct Employee {
    bool isHourly;
    union {
        HourlyInfo *hourlyEmployee;
        SalaryInfo *salaryEmployee;
    }

};

Code Block (B):

// applying unions to relevant and non-string data types
struct Employee {
    string firstName;
    string lastName;
    string title;
    bool isHourly;
    union {
        struct {
            double hourlyRate;
            int hoursWorked;
        } hourly;
        struct {
            double salary;
            double bonus;
        } salaried;
    };
};

Code Block (C):

// use cstring instead of string
struct HourlyInfo {
    cstring firstName[50];
    cstring lastName[50];
    string title[50];
    int hoursWorked;
    double hourlyRate;
};

struct SalaryInfo {
    cstring firstName[50];
    cstring lastName[50];
    cstring title[50];
    double salary;
    double bonus;
};

struct Employee {
    bool isHourly;
    union {
        HourlyInfo hourlyEmployee;
        SalaryInfo salaryEmployee;
    }
};

(Note: The idea behind the code is that any employee is either hourly or salary, thus the union here. Please don't suggest alternative solutions to this problem that do not involve unions. I'm not worried about solving a specific problem, I'm interested in unions.)

Also, pointers and other data types seem to vary greatly in their sizes:

What does the C++ standard state the size of int, long type to be?

How much memory does a C++ pointer use?

Does this mean there is no blanket statement we can make about memory efficiency here? If so, which factors should we consider in determining most-efficient methods?

You're not really making a fair comparison, because A and C also need a bool to identify which member of the union is active. — Steve Jessop, Oct 26 '13 at 20:04
Have you gone through and figured out the memory sizes before the union? Do you know how std::string works? Why do you have 50 of them? What is a cstring? `const char*`? — Yann Ramin, Oct 26 '13 at 20:05
I don't think you can have unnamed structures within unions in c++ — BЈовић, Oct 26 '13 at 20:09
What is your goal? why not `class Employee {...}; class HourlyEmployee : public Employee {...}; class SalaryEmployee : public Employee {...}` etc? — Charlie, Oct 26 '13 at 20:10
@SteveJessop it's a consequence of using those methods, not necessarily unfairness. Is there a variation of these methods I could use to make it more "fair?" — skippr, Oct 26 '13 at 20:10
"Please don't suggest alternative solutions to this problem that do not involve unions" — unions are a problem, not a solution. — n. m. could be an AI, Oct 26 '13 at 20:12
@sunday: what I mean is: if you have an instance of `Employee` under A or C then you don't know which union member is active. So they're useless, they don't do the job. Add a means to discriminate and you can make a comparison with B, but you cannot make an efficiency comparison between code that works and code that doesn't work. — Steve Jessop, Oct 26 '13 at 20:12
@SteveJessop that was a major oversight on my part. They should all have the isHourly boolean. Noted, and making the necessary changes. — skippr, Oct 26 '13 at 20:16
@n.m. Haha, yes, there are better ways to accomplish this goal. I had a class assignment that forced me into this situation, and it just got me thinking about the implications of different methods. — skippr, Oct 26 '13 at 20:18
In (C) it is not clear what is `cstring`, why you need arrays of them, and why you are mixing them with `string`s. — n. m. could be an AI, Oct 26 '13 at 20:23
Sure looks like a severe case of the [NIH](http://en.wikipedia.org/wiki/Not_invented_here) syndrome. If it's for study purposes, no problems with that (although I'd try to make that more clear from the question text) — sehe, Oct 26 '13 at 20:45
@sehe good point. I'll try to make that clearer in the question. It is indeed out of mere curiosity. — skippr, Oct 26 '13 at 20:47
What problem are you having calculating the sizes for yourself? — bames53, Oct 26 '13 at 20:59
@bames53 Sizes vary across platforms. I'm more worried about principle here than anything else, and if there's a general evaluation we can make about the use of unions. — skippr, Oct 26 '13 at 21:02
@sunday Well the answer as to which method is more space efficient varies across platforms too, as well as across usage scenarios. You need to compute the sizes for every platform and usage you care about. Also, if you do the exercise of computing the sizes for various circumstances then the principles should become clear. — bames53, Oct 26 '13 at 21:25
@bames53 even with variations, there must be a general range of values we can consider. If we know a range, we can throw up scenarios that consider how string overhead, for example, might affect our reasoning. See Steve Jessop's answer regarding overhead. Perhaps you'd like to expand on that? — skippr, Oct 26 '13 at 21:33
@sunday so are you saying you need help figuring out that general range of values, and that's the problem you ran into when trying to calculate this yourself? — bames53, Oct 26 '13 at 21:59
@bames53: there's more to it than figuring the range of values. For example I think A will use more memory than B on pretty much any C++ implementation I can contemplate without losing D6 SAN. So you don't necessarily need to compute the range of sizes in order to draw at least some reasonable conclusions. To me this makes the question non-stupid. — Steve Jessop, Oct 26 '13 at 22:02
@SteveJessop If double were 1000 bytes, and ints and pointers were 4 bytes then version A would use less memory unless the number of hourly employees were tiny relative to the number of salaried employees. (I have no idea what D6 SAN means but presumably 1000 byte doubles 'lose' it.) I'm not saying the question is dumb, just that it's clear that little effort was put into finding the answer before coming to stackoverflow. — bames53, Oct 26 '13 at 23:26

sehe · Answer 1 · 2013-10-26T20:30:40.417

3

Rule #1: follow your profiler (it will tell you which is more effective for your program)

Rule #2: regarding memory allocations: employ custom allocators to hide the complexity for you

Rule #3: design your datatypes for clear expression of intent/purpose (in that sense, only B is an option). Of course, unless Rule #1 necessitates another take (that's pretty unusual)

I know I'm "not allowed" to propose alternatives: (Live on Coliru)

#include <string>
#include <boost/variant.hpp>

struct HourlyInfo {
    int    hoursWorked;
    double hourlyRate;
};

struct SalaryInfo {
    double salary;
    double bonus;
};

namespace detail {

    template <template <typename...> class Allocator = std::allocator>
    struct basic_employee {
        using string = std::basic_string<char, std::char_traits<char>, Allocator<char>>;
        string firstName;
        string lastName;
        string title;

        using RateInfo = boost::variant<HourlyInfo, SalaryInfo>;
        RateInfo rates;
    };
}

using Employee = detail::basic_employee<>; // or use a custom (pool?) allocator

int main()
{
    Employee staff1 = { 
        "John", "Cage", "From accounting", 
        SalaryInfo { 1900.00, 120.0 } 
    };
    Employee contractor = { 
        "Joe", "Duffy", "Plumbing jobs", 
        HourlyInfo { 3, 46.00 } 
    };
}

edited Oct 26 '13 at 20:30

answered Oct 26 '13 at 20:23

sehe

374,641
47
450
633

1

My cat is allergic to Boost, so I can't use it here ;-) – skippr Oct 26 '13 at 20:45
5

@sunday Then get therapy^H^Ha new cat. Boost is a must for C++., – Konrad Rudolph Oct 26 '13 at 20:46
@sunday Cheers for editing it back into the more entertaining form :/ – sehe Oct 26 '13 at 20:48
@sehe agreed. Helps other people realize I'm joking :) BTW, I'm updating the question. This whole thing arose because a class assignment forced me to use unions and I was curious as to which was the better method. – skippr Oct 26 '13 at 20:50
@sunday Yeah, I figured, hence my [comment at the OP](http://stackoverflow.com/questions/19611235/memory-efficiency-regarding-the-use-of-unions-in-c#comment29111939_19611235) – sehe Oct 26 '13 at 20:51
@KonradRudolph http://stackoverflow.com/questions/1226206/is-there-a-reason-to-not-use-boost – skippr Oct 26 '13 at 20:51
@KonradRudolph +1 for the change :-) – skippr Oct 26 '13 at 20:54
@sunday Yeah, your explanation makes the whole thing way more reasonable. It also makes the question rather more intriguing in my view. – Konrad Rudolph Oct 26 '13 at 20:54
@KonradRudolph Can you remove the Boost content and maybe expand on your rules? I'm not interested in Boost for this specific question though I'm still interested in what you have to say. Please read the very last question in my post: "Does this mean there is no blanket statement..." ? – skippr Oct 26 '13 at 20:57
@sunday My answer is also not about boost: I'm just skimping over the tedious handy-work to get discriminated unions "done right". That's not the essence of your question, anyway. I'm just vouching for a clean way to design the code, and showing how to allow for great allocation control using the standard library's standard allocator options. You can certainly find more information on that on SO. ***Iff*** your profiler tells you so :) – sehe Oct 26 '13 at 21:06
Oops, sorry @KonradRudolph, my previous comment was meant for sehe. – skippr Oct 26 '13 at 21:08
@sehe so if you had to pick one of my three options, which would you choose and why? – skippr Oct 26 '13 at 21:35
@sunday That's in my answer: option B, because it's the only one (IMO) that best expresses intent of implementation. – sehe Oct 26 '13 at 22:02

Steve Jessop · Accepted Answer · 2013-10-26T22:06:13.970

B will probably use least memory, but 50 was a good number to pick because it leaves it in doubt.

With A, presumably you're going to make a separate memory allocation of one of the two possibilities. In terms of memory use this will pretty much always introduce some inefficiency, plus there's the space for the pointer itself, so it loses to B. It could potentially win, though, in the peculiar case where the two info structs differ in size by more than the size of a pointer, more than a certain proportion use the smaller struct, and you allocate them from a ferociously low-overhead memory allocator such as a pool allocator.

With C I assume you mean an array of 50 char, not an array of 50 string or cstring. I believe that the average length of a name, plus the overhead of string, is less than 50 characters, and that's the basis on which I say that B beats C. However, you are correct that the overhead of string depends on some implementation details, so I can't say this categorically. Furthermore, if you're dealing with people whose names are all slightly under 50 characters then C would win. I just think that's unlikely.

Of course, C is more limited in that it cannot deal at all with someone who has a name over 50 characters (49 if you store the strings nul-terminated).

[Edit: thinking about it again, the overhead of a string could be:

8 bytes for a start pointer
8 bytes for an end pointer
8 bytes for capacity
two more pointers (16 bytes) for the header of the allocation that contains the string data, with the allocation itself rounded up to 8 or 16 bytes.

Total 48 or 56 bytes for a short string (although there's a thing called "short string optimization" that makes short strings better, although depending on details it could make long strings worse). With that implementation of string and memory allocation, C would win, and even without the rounding up it might well win in Sri Lanka.

So, it is worth working out how to measure the actual memory use.]

Remember that the name char arrays are separated into first and last. I would think most names WILL BE under 50 characters. Does that go against "C would win. I just think that's unlikely." ? — skippr, Oct 26 '13 at 20:41
Also, you were correct on the `s/string/cstring` assumption in Code Block (C). — skippr, Oct 26 '13 at 20:42
@sunday: it is true that if the names are all slightly under 50 characters then C would win. It is also true that I think it's unlikely that the names are all slightly under 50 characters. If we take the value of "slightly" as the overhead of a string then we can say what matters is whether the average name length is greater or less than `50 - slightly`. This holds whether "name" means the full name or, as in this case, each of the three parts considered separately. If it were the full name then I'd be less confident. The overhead of a string could easily be 16+ bytes ... — Steve Jessop, Oct 26 '13 at 20:57
... and I have a pretty short full name but it's still 19 characters including two spaces. — Steve Jessop, Oct 26 '13 at 20:58
I misunderstood your use of slightly, but I'm following you. Thanks for the continued expansion in your answer. I'm not sure why people aren't upvoting yours as it's actually addressing the issue here. — skippr, Oct 26 '13 at 21:06
@sunday: maybe all the people not upvoting are on 64 bit machines where `string` is pretty huge ;-) — Steve Jessop, Oct 26 '13 at 21:08
your edit brings up a huge point; I hadn't thoroughly considered the effect of string overhead. — skippr, Oct 26 '13 at 21:25

Memory Efficiency Regarding The Use of Unions in C++

2 Answers2