Since ever I carefully consider alignment of data structures. It hurts letting the CPU shuffling bits before processing can be done. Gut feelings aside, I measured the costs of unaligned data: Write 64bit longs into some GB of memory and then read their values, checking correctness.
// c++ code
const long long MB = 1024 * 1024;
const long long GB = 1024 * MB;
void bench(int offset) // pass 0..7 for different alignments
{
int n = (1 * GB - 1024) / 8;
char* mem = (char*) malloc(1 * GB);
// benchmarked block
{
long long* p = (long long*) (mem + offset);
for (long i = 0; i < n; i++)
{
*p++ = i;
}
p = (long long*) (mem + offset);
for (long i = 0; i < n; i++)
{
if (*p++ != i) throw "wrong value";
}
}
free(mem);
}
The result surprised me:
1st run 2nd run %
i = 0 221 i = 0 217 100 %
i = 1 228 i = 1 227 105 %
i = 2 260 i = 2 228 105 %
i = 3 241 i = 3 228 105 %
i = 4 219 i = 4 215 99 %
i = 5 233 i = 5 228 105 %
i = 6 227 i = 6 229 106 %
i = 7 228 i = 7 228 105 %
The costs are just 5% (if we randomly store it at any memory location, costs would be 3,75% since 25% would land aligned). But storing data unaligned has the benefit of being a bit more compact, so the 3,75% benefit could even be compensated.
Tests run on Intel 3770 CPU. Did many variations of this benchmarks (eg using pointers instead of longs; random read access to change cache effects) all leading to similar results.
Question: Is data structure alignment still as important as we all thought it is?
I know there are atomicity aspects when 64bit values spread over cache lines, but that is not a strong argument either for alignment, because larger data structs (say 30, 200bytes or so) will often spread across them.
I always believed strongly in the speed argument as laid out nicely here for instance: Purpose of memory alignment and do not feel well disobeying the old rule. But : Can we measure the claimed performance boosts of proper alignment?
A good answer could provide a reasonable benchmark showing a boost of factor of > 1.25 for aligned vs unaligned data. Or demonstrate that commonly used other modern CPUs are much more affected by unalignment.
Thank you for your thoughts measurements.
edit: I am concerned about classical data structures where structs are held in memory. In contrast to special case scenarios like scientific number crunching scenarios.
update: insights from comments:
Misaligned memory operands handled efficiently on Sandy Bridge
On the Sandy Bridge, there is no performance penalty for reading or writing misaligned memory operands, except for the fact that it uses more cache banks so that the risk of cache conflicts is higher when the operand is misaligned.Store-to-load forwarding also works with misaligned operands in most cases.
Unaligned access might be faster(!) on Sandy Bridge due to cache organisation.