4

I need to do this millions of times as fast as possible. Say I have two lists of several short char arrays:

"a b ", "a c ", "a x ", etc...
" w z", " w y", " q b"

Now I want to form combinations of one from each list. For example, "a b " and " w z" would become "awbz".

It seems like the most efficient way would be to store them as a 32-bit sequences:

"a b " --> 0x00620061
" w z" --> 0x7A007700

Now OR them together to get

0x7A627761 --> "awbz"

My first thought is to use a union, but I know that this technically presents undefined behavior...writing to part of a union variable followed by reading a different type from the union.

union {
  unsigned char[4] c;
  unsigned int i;
};

My second thought would be to use casts to switch between int and char[]. Is there a way to safely do it this way?

JohnPS
  • 2,518
  • 19
  • 17
  • What about some sort of look-up table (if the two lists are short enough)? – jrok Jan 27 '12 at 21:54
  • Good idea, but the lists will be too big, so there will be too many combinations, especially if I try this with forming 8 byte char lists composed of 4 subparts (which I didn't mention). – JohnPS Jan 27 '12 at 22:01

3 Answers3

2

The good news is that in C11 it's not undefined behaviour to read a union member other than the one last written to. Footnote 95 to 6.5.2.3 says

If the member used to read the contents of a union object is not the same as the member last used to store a value in the object, the appropriate part of the object representation of the value is reinterpreted as an object representation in the new type as described in 6.2.6 (a process sometimes called ‘‘type punning’’). This might be a trap representation.

The bad news is that C11 compilers are still rare. However, most compilers behave as expected, and gcc has long guaranteed that behaviour. I would just use the union unless there are very strong reasons not to.

Daniel Fischer
  • 181,706
  • 17
  • 308
  • 431
  • Oh, that's good. I know VC++ has always seemed to treat unions this way, but I'm not sure it's guaranteed. I'll look into this. I never heard the term "type punning," so thanks for that. – JohnPS Jan 27 '12 at 22:09
2

In C++, type punning to char* is always allowed. So you're in luck.

Just use int32_t to store the values as you propose. Store the result of the bitwise-OR in a variable, and use reinterpret_cast on its address.

int32_t first = 0x00620061;
int32_t second = 0x7A007700;
int32_t combined = first | second;
std::string s(reinterpret_cast<const char*>(&combined), 4);

Demo: http://ideone.com/1KGBl

Ben Voigt
  • 277,958
  • 43
  • 419
  • 720
  • This is probably the method I'll use. But what do you mean by "type punning to `char*` is always allowed"? Does `reinterpret_cast` have special guarantees over casts to other built in types? – JohnPS Jan 27 '12 at 22:19
  • The language spec is the spec, but I don't understand why `int` can't be treated the same in the rules. Does this have to do with endianness or something else? – JohnPS Jan 27 '12 at 22:58
  • @JohnPS: No, it has to do with optimization. For example, if you have a loop that writes to an `int[]` and reads from a `float&`, the strict aliasing rule allows the compiler to move reading the `float` above the loop. There's a ton of good information in the other answers on that page. – Ben Voigt Jan 27 '12 at 23:02
0

Whether or not it is UB in some C/C++ specs, the reality is using a union is likely to work as expected in just about every compiler, eg:

union Char4Int32
{ 
  unsigned char[4] c; 
  unsigned int32_t i; 
}; 

Char4Int32 first, second, combined;
strncpy(first.c, "a b ", 4); 
strncpy(second.c, " w z", 4); 
combined.i = first.i | second.i;
std::string s(combined.c, 4); 
Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
  • The space characters will not be zero. In this case, since space is `0x20` and lower case characters in ascii all have that bit set, I think it would work. In general, we'd have to do `"a\0b\0"` and `"\0w\0z"`. – JohnPS Jan 28 '12 at 01:42