Practical way of implementing comparison of unions in c

Question

For some testing purposes, I need to compare two unions to see if they are identical. Is it enough to compare all members one by one?

union Data {
    int i;
    float f;
    char* str;
} Data;

bool isSame(union Data left, union Data right)
{
    return (left.i == right.i) && (left.f == right.f) && (left.str == right.str);
}

My hunch is that it could fail if one the unions has first contained a larger type and then switched to a smaller type. I have seen some suggestions mentioning wrapping the union in a struct (like here: What is the correct way to check equality between instances of a union?) which keeps track of which data type that the union currently is, but I don't see how that would practically be implemented. Would I not need to manually set the union type in every instance where I set the union value?

struct myData
{
    int dataType;
    union {
        ...
    } u;
}

void someFunc()
{
    struct myData my_data_value = {0};
    my_data_value.u.i = 5;
    my_data_value.u.dataType = ENUM_TYPE_INTEGER;

    my_data_value.u.f = 5.34;
    my_data_value.u.dataType = ENUM_TYPE_FLOAT;
    
    ...
}

It does not seem warranted to double all code where my union is involved simply to be able to make a perfect comparison of the union values. Am I missing some obvious smart way to go about this?

What exactly are you trying to conclude with the comparison? Equality of certain elements? Then compare them. Equality of the occupied memory? Then use `memcmp` — Eugene Sh., Sep 17 '21 at 14:51
I think you'd be better off writing a function to compare `myData`s rather than `u`s. You can start with an early-out if the `dataType` is different. Then, you can switch on the `dataType` and compare the union member that's valid. I think that's probably the best balance between safety and performance. It's been a long time since I used unions, but... if you're not tracking which member is valid, you've potentially got bigger problems than being able to compare two of them. — Tim Randall, Sep 17 '21 at 14:55
@EugeneSh. If I push some general data through parts of my program, I know that it ends up in a union (e.g. union myData finalData = pushThrough("1234.09")), and I would like to write a test like: assert(isSame(1234.09, finalData) == true), but it would be nice if I could generalize if to take a union on the left side as well, so I don't have to make a comparison fcn for every type in my union (e.g. isSameFloat(...), isSameDouble()..., isSameInt()...). Does that make it clearer? — guybrush_threepwood, Sep 17 '21 at 15:08
@TimRandall Are you saying that it is mandatory to always wrap the union in a struct? I'm using the union to contain data that could be of different types, mainly as a way of simplyfying/generalizing my code (so I don't need to make different functions for every type I use). I "know" what type the data is when I need to convert it so I can use the correct member then, but in-between it's just some "generic" data, symbolized by the union. I don't know if this is "correct", but I found it making my life simpler at that moment. — guybrush_threepwood, Sep 17 '21 at 15:17
In order to use a union properly (*), you need to know which element is the currently active one. It doesn't matter how you know that. You may keep the current element ID in a struct, or compute it on the fly from other data, or whatever, but you need to have this information somewhere, somehow. Otherwise a union is just a meaningless bunch of bits. (*) By "properly" I mean "without type punning". Type punning is technically allowed, but it does turn your union into a meaningless bunch of bits. If that's what you want, compare with `memcmp`. — n. m. could be an AI, Sep 17 '21 at 15:52

rici · Accepted Answer · 2021-09-17T17:27:17.167

If your proposal worked, then you could achieve the same effect without multiple comparisons by using memcmp(&left, &right, sizeof left). But that won't work and neither will your proposal, for the same reason.

First, assignment to a union member which does not occupy all the bytes allocated to the union has unspecified effect on the unoccupied bytes. The most likely is that they will not be modified from their previous values, but any value is possible. Comparing the values of such bytes has an unspecified result.

You might think that memsetting the bytes of the union to 0 before assigning a member would allow the comparison to work, but the standard does not require the unused bytes to be unmodified. Moreover, many compilers will optimise away the attempt to clear the union on the grounds that it has no legal effect if the next statement gives the union a new value.

There are other reasons why trying to compare union members which are not the current value, which apply even if neither value includes padding.

For example, if you don't know that the two union values currently have the same active member, you might get a false equivalence. (Every float has the same bit pattern as some int but the two values are certainly not the same.)

Less obviously, it's possible for two values with different bit patterns to actually be equal. (Floating point 0.0 and -0.0 are considered equal, for example.)

Finally, not every bit pattern is a valid float; if one or both of the union values is an int whose bit pattern corresponds to a floating NaN, trying to compare the values as floats will certainly produce the wrong answer (a NaN is not equal to itself) and may throw a floating point exception.

In short, if you don't know which type is active for a union, you cannot usefully use the union value, other than to assign it to another object of the same union type. That means that there must be some mechanism, internal or external, which identifies the active type of the union.

The choice between external mechanisms (used, for example, in yacc-generated parsers) and internal mechanisms (so-called "discriminated unions", as you suggest at the end of your question) will depend on the precise application environment.

`(Actually, there's an additional reason your proposal won't work: accessing a union member other than the last one used to assign the union is Undefined Behaviour.` in C++ not in C — 0___________, Sep 17 '21 at 16:47
@0___________: True, it's only UB if the bit pattern isn't a value of the type used to access it. So I should have said "may be Undefined Behaviour" but on reflection, it seemed easier to just remove the parenthetic observation, since it's not really relevant. — rici, Sep 17 '21 at 17:28

score 0 · Answer 2 · edited Sep 17 '21 at 17:24

What I would do is to either compare the largest elements against each other (when having float or int) or, when having a pointer type (string), compare the elements of the pointer, so he have something like this:

bool isSame(struct myData d1, struct myData d2)
{
    if(d1.dataType != d2.dataType)
        return 0; // invalid comparison
    if(d1.dataType == ENUM_TYPE_STRING)
        return !strcmp(d1.str, d2.str); // <--- compare the strings
    return d1.str == d2.str; // <--- compare either int/int or float/float
}

This will either compare the two string or the numbers depending on the data type.

I said d1.str; I just put the union into the struct as unnamed union so you can just access the variables from the myData struct:

struct myData
{
    int dataType;
    union {
        int i;
        float f;
        char *str;
    };
};

Now, I must say that I have not completely answered your question here …

You must keep track of the data type. If you look at the isSame function above, we must not compare invalid data against each other, e.g. string pointer and float, that does not make sense. Even using memcmp over it, it will not compare the bytes that are at that string pointer, but rather the pointer value itself.

So you must keep track of the data type and wrap your data union into another struct.

dbush · Answer 3 · 2021-09-17T15:27:26.457

The tricky thing about a union is that there's no way to tell which member is the "active" one. If all of the members happen to be the same size you can get away with checking just one member.

If the members are not the same size there are a few pitfalls you could encounter.

If you set a larger member followed by a smaller member, the extra bytes used by the larger member will be unchanged. For example:

union u1 {
    unsigned int a;
    unsigned short b;
};
union u1 x,y;
x.a = 0x12345678;
y.a = 0x87654321;
x.b = 0;
y.b = 0;

Logially x and y have the same value, but x.a == y.a would be false and memcmp(&x, &y, sizeof x) would return nonzero.

Setting just the smaller value could be even worse:

union u1 x,y;
x.b = 0;
y.b = 0;

Since the extra bytes have indeterminate values and performing x.a == y.a would trigger undefined behavior by attempting to read those values.

You need to keep track of the active member in some way to know which one to read. The simplest way to do this is to wrap the union in a struct with a "tag" field so you know which one to check.

Your isSame function mentioned in the comments would have to take two instances of the containing struct and use a switch statement to choose which field to check. When you call it, you can use a compound literal to create a temp copy of the struct to compare against, i.e.:

isSame((struct myData){ .datatype = ENUM_TYPE_INTEGER, .u = { .i= 5 }}, finalData)
isSame((struct myData){ .datatype = ENUM_TYPE_FLOAT, .u = { .f= 5.34 }}, finalData)

score 0 · Answer 4 · answered Sep 17 '21 at 15:35

Is it enough to compare all members one by one?

Depends on how you compare.
Consider:

union Data {
    float f;
} Data;

if (a.f == b.f) ....

a.f == b.f is true when a.f is +0.0 and b.f is -0.0.
a.f == b.f is false when a.f or b.f is a NaN, even with the same bit pattern.

Better to use memcmp().

It does not seem warranted to double all code where my union is involved simply to be able to make a perfect comparison of the union values.

It is test code. Do not worry about double all code. Make a perfect compare.

Am I missing some obvious smart way to go about this?

Comparing the widest type with memcmp() should be suffcient.

It is unclear how OP wants to handle the left-over junk in narrower fields. IMO, the memcmp() compare should only happen with the last member assigned as indicated by .dataType.

Luis Colorado · Answer 5 · 2021-09-19T12:02:27.067

You cannot compare two unions if you don't know which selector was used in the last assignment to both variables. The only way to compare both unions (correctly) is that they have been assigned using the same selector field, and have the same value, using the comparison available for the type of that selector.

Let's say, you have:

union data {
    int i;
    float f;
};

and you have two variables A and B, that have been assigned this way:

A.i = 0x80000000; /* the integer value -2147483648 */
B.f = 0.0; /* the float value 0.0 */

The first, using the float selector, with IEEE-722 binary floating point representation, they can compare to true or false, as A.i reinterpreted as a float is -0.0, which matches equal with B.f (as in floating point -0.0 == +0.0), if you compare A.f == B.f. But they will compare to false if you compare A.i == B.i, (A.i should be -2147483648, while B.i should be 0) The binary images are indeed different. So you need to know which field selector has been used in the last assignment.

Also, imagine that you have:

union data {
    char c;
    char s[100];
}

and imagine you have, as before, variables A and B, that have been assigned:

    strcpy(A.s, "hello, kitty");
    strcpy(B.s, "hello, world");

they will compare as true if they are compared with A.c == B.c, as the first character in both strings is the same. But to false if they are compared with strcmp(A.s, B.s) == 0 (this should be the correct way of comparing them in this case)

More, because if we later do A.c = 'H'; and B.c = 'H';, then they will compare true if we use A.c == B.c (this should be now the correct way to compare) while they will compare false if we use strcmp(A.s, B.s) == 0. Anyway, the second selector (the char [100] typed one) can be compared as an array of chars (lexicographically, or any other kind of collations) or as strings (null delimited), giving different results, depending on the history of assignments they have had.

Finally, let's say you have:

union data {
    struct {
        char a1; /* compiler should pad 3 byte space before next field */
        int[100] a2; /* compiler can pad 4 byte space space before next field */
        double[23] a3;
    } a;
    struct {
        double[12] b1; 
        char b2; /* compiler could pad 3 bytes space before next field */
        int b3[100];
    } b;
};

How should we compare then? (Think that the padded holes in one selector can be valid data in another.)

Practical way of implementing comparison of unions in c

5 Answers5