4

After having read the following 1 and 2 Q/As and having used the technique discussed below for many years on x86 architectures with GCC and MSVC and not seeing a problems, I'm now very confused as to what is supposed to be the correct but also as important "most efficient" way to serialize then deserialize binary data using C++.

Given the following "wrong" code:

int main()
{
   std::ifstream strm("file.bin");

   char buffer[sizeof(int)] = {0};

   strm.read(buffer,sizeof(int));

   int i = 0;

   // Experts seem to think doing the following is bad and
   // could crash entirely when run on ARM processors:
   i = reinterpret_cast<int*>(buffer); 

   return 0;
}

Now as I understand things, the reinterpret cast indicates to the compiler that it can treat the memory at buffer as an integer and subsequently is free to issue integer compatible instructions which require/assume certain alignments for the data in question - with the only overhead being the extra reads and shifts when the CPU detects the address it is trying to execute alignment oriented instructions is actually not aligned.

That said the answers provided above seem to indicate as far as C++ is concerned that this is all undefined behavior.

Assuming that the alignment of the location in buffer from which cast will occur is not conforming, then is it true that the only solution to this problem is to copy the bytes 1 by 1? Is there perhaps a more efficient technique?

Furthermore I've seen over the years many situations where a struct made up entirely of pods (using compiler specific pragmas to remove padding) is cast to a char* and subsequently written to a file or socket, then later on read back into a buffer and the buffer cast back to a pointer of the original struct, (ignoring potential endian and float/double format issues between machines), is this kind of code also considered undefined behaviour?

The following is more complex example:

int main()
{
   std::ifstream strm("file.bin");

   char buffer[1000] = {0};

   const std::size_t size = sizeof(int) + sizeof(short) + sizeof(float) + sizeof(double);

   const std::size_t weird_offset = 3;

   buffer += weird_offset;

   strm.read(buffer,size);

   int    i = 0;
   short  s = 0;
   float  f = 0.0f;
   double d = 0.0;

   // Experts seem to think doing the following is bad and
   // could crash entirely when run on ARM processors:
   i = reinterpret_cast<int*>(buffer); 
   buffer += sizeof(int);

   s = reinterpret_cast<short*>(buffer); 
   buffer += sizeof(short);

   f = reinterpret_cast<float*>(buffer); 
   buffer += sizeof(float);

   d = reinterpret_cast<double*>(buffer); 
   buffer += sizeof(double);

   return 0;
}
Community
  • 1
  • 1
Sami Kenjat
  • 185
  • 2
  • 12
  • You can solve the alignment problem using, e.g., `std::aligned_storage::value>::type` instead of `char[sizeof(int)]` (or, if you don't have C++11, there may be similar compiler-specific functionality). – abarnert Nov 09 '12 at 05:12
  • 2
    @abarnert the example above is simple, in general one would serialize many different types sequentially into one buffer and expect to read them all back, furthermore the "buffer" may be request from a pool which may not have any alignas done etc... – Sami Kenjat Nov 09 '12 at 05:18
  • From the tone of your question, it sounds like you're not so much asking "can I do this" (you know you can't) or "why won't it work?" (it's already been explained to you) so much as "why didn't they design the language differently so it _would_ work?" (Or maybe even, "Is there a language close to C++ that was designed so it would work?") – abarnert Nov 09 '12 at 05:30
  • 1
    @abarnert: I understand the architectural reasons as to why the standard did not want to say anything specific about type aliasing from arbitary memory, my question is more about an efficient yet correct and defined behavior solution - if there is one specially one that does not require memcpy which will be very expensive even for 4 or 8 byte copies... in fact more expensive than the extra doubling of reads and shift left/right that happens to correct for nonaligned reads. – Sami Kenjat Nov 09 '12 at 05:34
  • What makes you think that copying 4 bytes will be slower than reading 4 bytes, reading another 4 bytes, shifting, or'ing, and storing the result? – abarnert Nov 09 '12 at 05:36
  • @abarnet: I've done some tests and it seems the reads then the shifts et al are being done concurrently, where as memcpy had a function call and stack setup overhead and the copy as well (if the compiler doesn't treat it as a special function - which GCC does) – Sami Kenjat Nov 09 '12 at 05:43
  • So, the compiler you tested with _does_ go faster with `memcpy`, but you decided it was cheating, so it doesn't count? – abarnert Nov 09 '12 at 05:48
  • @abarnert: well there's being standards conformant and running code that is well defined... then there's the issue of actually getting work done. They're unfortunately not always one in the same. – Sami Kenjat Nov 09 '12 at 05:50
  • And nobody's saying you have to always write standards-compliant code. When someone says "your code will crash on an ARM", that means you'd better not do it in code that may have to be compiled for ARM. If you use it in code that will only ever run on, say, x86_64 Windows 7, it's fine. You should _know_ that your code isn't standard-compliant, and that it will crash on ARM, in case you ever need to port it, or write something similar in an iOS app, but you can still use it to write your Windows 7 app. – abarnert Nov 09 '12 at 06:08
  • Meanwhile, if you're writing code that may have to run on ARM, "actually getting work done" presumably means not having a 75% chance of crashing each time you read an `int` from a file or a socket, right? – abarnert Nov 09 '12 at 06:14
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/19326/discussion-between-sami-kenjat-and-abarnert) – Sami Kenjat Nov 09 '12 at 08:00

1 Answers1

8

First, you can correctly, portably, and efficiently solve the alignment problem using, e.g., std::aligned_storage::value>::type instead of char[sizeof(int)] (or, if you don't have C++11, there may be similar compiler-specific functionality).

Even if you're dealing with a complex POD, aligned_stored and alignment_of will give you a buffer that you can memcpy the POD into and out of, construct it into, etc.

In some more complex cases, you need to write more complex code, potentially using compile-time arithmetic and template-based static switches and so on, but so far as I know, nobody came up with a case during the C++11 deliberations that wasn't possible to handle with the new features.

However, just using reinterpret_cast on a random char-aligned buffer is not enough. Let's look at why:

the reinterpret cast indicates to the compiler that it can treat the memory at buffer as an integer

Yes, but you're also indicating that it can assume that the buffer is aligned properly for an integer. If you're lying about that, it's free to generate broken code.

and subsequently is free to issue integer compatible instructions which require/assume certain alignments for the data in question

Yes, it's free to issue instructions that either require those alignments, or that assume they're already taken care of.

with the only overhead being the extra reads and shifts when the CPU detects the address it is trying to execute alignment oriented instructions is actually not aligned.

Yes, it may issue instructions with the extra reads and shifts. But it may also issue instructions that don't do them, because you've told it that it doesn't have to. So, it could issue a "read aligned word" instruction which raises an interrupt when used on non-aligned addresses.

Some processors don't have a "read aligned word" instruction, and just "read word" faster with alignment than without. Others can be configured to suppress the trap and instead fall back to a slower "read word". But others—like ARM—will just fail.

Assuming that the alignment of the location in buffer from which cast will occur is not conforming, then is it true that the only solution to this problem is to copy the bytes 1 by 1? Is there perhaps a more efficient technique?

You don't need to copy the bytes 1 by 1. You could, for example, memcpy each variable one by one into properly-aligned storage. (That would only be copying bytes 1 by 1 if all of your variables were 1-byte long, in which case you wouldn't be worried about alignment in the first place…)

As for casting a POD to char* and back using compiler-specific pragmas… well, any code that relies on compiler-specific pragmas for correctness (rather than for, say, efficiency) is obviously not correct, portable C++. Sometimes "correct with g++ 3.4 or later on any 64-bit little-endian platform with IEEE 64-bit doubles" is good enough for your use cases, but that's not the same thing as actually being valid C++. And you certainly can't expect it to work with, say, Sun cc on a 32-bit big-endian platform with 80-bit doubles and then complain that it doesn't.

For the example you added later:

// Experts seem to think doing the following is bad and
// could crash entirely when run on ARM processors:
buffer += weird_offset;

i = reinterpret_cast<int*>(buffer); 
buffer += sizeof(int);

Experts are right. Here's a simple example of the same thing:

int i[2];
char *c = reinterpret_cast<char *>(i) + 1;
int *j = reinterpret_cast<int *>(c);
int k = *j;

The variable i will be aligned at some address divisible by 4, say, 0x01000000. So, j will be at 0x01000001. So the line int k = *j will issue an instruction to read a 4-byte-aligned 4-byte value from 0x01000001. On, say, PPC64, that will just take about 8x as long as int k = *i, but on, say, ARM, it will crash.

So, if you have this:

int    i = 0;
short  s = 0;
float  f = 0.0f;
double d = 0.0;

And you want to write it to a stream, how do you do it?

writeToStream(&i);
writeToStream(&s);
writeToStream(&f);
writeToStream(&d);

How do you read back from a stream?

readFromStream(&i);
readFromStream(&s);
readFromStream(&f);
readFromStream(&d);

Presumably whatever kind of stream you're using (whether ifstream, FILE*, whatever) has a buffer in it, so readFromStream(&f) is going to check whether there are sizeof(float) bytes available, read the next buffer if not, then copy the first sizeof(float) bytes from the buffer to the address of f. (In fact, it may even be smarter—it's allowed to, e.g., check whether you're just near the end of the buffer, and if so issue an asynchronous read-ahead, if the library implementer thought that would be a good idea.) The standard doesn't say how it has to do the copy. Standard libraries don't have to run anywhere but on the implementation they're part of, so your platform's ifstream could use memcpy, or *(float*), or a compiler intrinsic, or inline assembly—and it will probably use whatever's fastest on your platform.

So, how exactly would unaligned access help you optimize this or simplify it?

In nearly every case, picking the right kind of stream, and using its read and write methods, is the most efficient way of reading and writing. And, if you've picked a stream out of the standard library, it's guaranteed to be correct, too. So, you've got the best of both worlds.

If there's something peculiar about your application that makes something different more efficient—or if you're the guy writing the standard library—then of course you should go ahead and do that. As long as you (and any potential users of your code) are aware of where you're violating the standard and why (and you actually are optimizing things, rather than just doing something because it "seems like it should be faster"), this is perfectly reasonable.

You seem to think that it would help to be able to put them into some kind of "packed struct" and just write that, but the C++ standard does not have any such thing as a "packed struct". Some implementations have non-standard features that you can use for that. For example, both MSVC and gcc will let you pack the above into 18 bytes on i386, and you can take that packed struct and memcpy it, reinterpret_cast it to char * to send over the network, whatever. But it won't be compatible with the exact same code compiled by a different compiler that doesn't understand your compiler's special pragmas. It won't even be compatible with a related compiler, like gcc for ARM, which will pack the same thing into 20 bytes. When you use non-portable extensions to the standard, the result is not portable.

abarnert
  • 354,177
  • 51
  • 601
  • 671
  • 2
    abarnert: "You seem to think that it would help to be able to put them into some kind of "packed struct" - No thats not what I said, all I said that it was common practise in code bases I've seen to use such a trick. – Sami Kenjat Nov 09 '12 at 05:46
  • @abarnet: wrt the multiple writes I'm ok with that, but rather than call ifstream.write multiple times, it may be more efficient to write all the variables you want to a buffer, then write the buffer using one call to .write - btw the issue of writing from a type to a buffer is not the problem, its going back that is problematic. – Sami Kenjat Nov 09 '12 at 05:48
  • @SamiKenjat: You do realize that `ifstream` is buffered, and the buffering is done by the people who wrote the standard library for your platform and will probably do a better job than you unless there are some unusual application-specific issues, right? – abarnert Nov 09 '12 at 05:50
  • 2
    @abarnet: ofstream right? This is a QA site as such examples are kept simple - toy examples, in real life one may typically serialize MBs or even GBs of data, this 64KB ofstream buffer is useless and when a write larger than the buffer kicks in, the internal ofstream buffer is completely ignored, as for going the other way round, please also take into account the many other means by which data can enter a program (sockets, usb etc) none of these input types buffer natively. – Sami Kenjat Nov 09 '12 at 05:54
  • I said `ifstream` because you said "it's going back that is problematic", and that's `ifstream`. But the same is true in both directions. Besides, why is a 64KB buffer useless? When you serialize GBs of data, you're still writing 8K disk blocks or 4K network sends. At any rate, if you don't like the buffer, you can replace it with a larger one, or build your own buffered stream from scratch. You can write platform-specific code for platforms you think you can optimize better than the implementer, and portable code for everything else. So, I still fail to see what your problem is. – abarnert Nov 09 '12 at 06:01
  • And as for other types not being buffered… you think socket I/O isn't buffered? Are you calling `recv()` and `send()` 1 byte at a time? At any rate, the whole point of `FILE*`, `fstream`, etc. is that they wrap a buffer around something else, like a POSIX fd or a Windows HANDLE. – abarnert Nov 09 '12 at 06:03
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/19325/discussion-between-sami-kenjat-and-abarnert) – Sami Kenjat Nov 09 '12 at 08:00