4

Since ever I carefully consider alignment of data structures. It hurts letting the CPU shuffling bits before processing can be done. Gut feelings aside, I measured the costs of unaligned data: Write 64bit longs into some GB of memory and then read their values, checking correctness.

// c++ code
const long long MB = 1024 * 1024;
const long long GB = 1024 * MB;

void bench(int offset) // pass 0..7 for different alignments
{
    int n = (1 * GB - 1024) / 8;
    char* mem = (char*) malloc(1 * GB);
    // benchmarked block
    {
        long long* p = (long long*) (mem + offset);
        for (long i = 0; i < n; i++)
        {
            *p++ = i;
        }
        p = (long long*) (mem + offset);
        for (long i = 0; i < n; i++)
        {
            if (*p++ != i) throw "wrong value";
        }
    }
    free(mem);
}

The result surprised me:

1st run         2nd run       %
i = 0   221     i = 0   217   100 %
i = 1   228     i = 1   227   105 %
i = 2   260     i = 2   228   105 %
i = 3   241     i = 3   228   105 %
i = 4   219     i = 4   215    99 %
i = 5   233     i = 5   228   105 %
i = 6   227     i = 6   229   106 %
i = 7   228     i = 7   228   105 %

The costs are just 5% (if we randomly store it at any memory location, costs would be 3,75% since 25% would land aligned). But storing data unaligned has the benefit of being a bit more compact, so the 3,75% benefit could even be compensated.

Tests run on Intel 3770 CPU. Did many variations of this benchmarks (eg using pointers instead of longs; random read access to change cache effects) all leading to similar results.

Question: Is data structure alignment still as important as we all thought it is?

I know there are atomicity aspects when 64bit values spread over cache lines, but that is not a strong argument either for alignment, because larger data structs (say 30, 200bytes or so) will often spread across them.

I always believed strongly in the speed argument as laid out nicely here for instance: Purpose of memory alignment and do not feel well disobeying the old rule. But : Can we measure the claimed performance boosts of proper alignment?

A good answer could provide a reasonable benchmark showing a boost of factor of > 1.25 for aligned vs unaligned data. Or demonstrate that commonly used other modern CPUs are much more affected by unalignment.

Thank you for your thoughts measurements.

edit: I am concerned about classical data structures where structs are held in memory. In contrast to special case scenarios like scientific number crunching scenarios.

update: insights from comments:

  1. from http://www.agner.org/optimize/blog/read.php?i=142&v=t

Misaligned memory operands handled efficiently on Sandy Bridge

On the Sandy Bridge, there is no performance penalty for reading or writing misaligned memory operands, except for the fact that it uses more cache banks so that the risk of cache conflicts is higher when the operand is misaligned.Store-to-load forwarding also works with misaligned operands in most cases.

  1. http://danluu.com/3c-conflict/

Unaligned access might be faster(!) on Sandy Bridge due to cache organisation.

Community
  • 1
  • 1
citykid
  • 9,916
  • 10
  • 55
  • 91
  • 1
    The Intel Sandy Bridge architecture almost completely removes the penalty for misaligned memory operands, which I expect is what you are seeing. However, some SSE instructions still require alignment. – Matthew Watson Oct 24 '16 at 10:58
  • @MatthewWatson very enlightening, thank you so much! will research Sandy Bridge architecture. – citykid Oct 24 '16 at 11:06
  • 1
    This may be of interest: http://www.agner.org/optimize/blog/read.php?i=142&v=t – Matthew Watson Oct 24 '16 at 11:24
  • If you measured this on Core2 you would probably get a very different result – harold Oct 24 '16 at 11:33
  • Results for `double` values would be very interesting also, especially if the test involved calculations complex enough to not fit in the available registers. – Andrew Henle Oct 24 '16 at 13:26

1 Answers1

4

Yes, data alignment is an important prerequisite for vectorisation on architectures that only support SSE, which has strict data alignment requirements or on newer architectures such as Xeon PHI. Intel AVX, does support unaligned access, but aligning data is still considered a good practice, to avoid unnecessary performance hits:

Intel® AVX has relaxed some memory alignment requirements, so now Intel AVX by default allows unaligned access; however, this access may come at a performance slowdown, so the old rule of designing your data to be memory aligned is still good practice (16-byte aligned for 128-bit access and 32-byte aligned for 256-bit access). The main exceptions are the VEX-extended versions of the SSE instructions that explicitly required memory-aligned data: These instructions still require aligned data

On these architectures, codes where vectorisation is useful (e.g. scientific computing applications with heavy use of floating point) may benefit from meeting the respective alignment prerequisites; the speedup would be proportional to the number of vector lanes in the FPU (4, 8, 16X). You can measure the benefits of vectorisation yourself by comparing software such as Eigen or PetSC or any other scientific software with / without vectorisation (-xHost for icc, -march=native for gcc), you should easily get a 2X speedup.

paul-g
  • 3,797
  • 2
  • 21
  • 36
  • Thank you - interesting input. Usual random access data structures should not be affected by vectorisation as I understand it. – citykid Oct 24 '16 at 10:43
  • Unaligned access isn't a prerequisite for vectorization though – harold Oct 24 '16 at 11:26
  • @harold: you mean "aligned" or "unaligned" ? – citykid Oct 24 '16 at 11:50
  • 2
    Aligned, but whatever. Telling the compiler that data is aligned makes it generate shorter code, otherwise there will be some more cruft. But it can still be vectorized. – harold Oct 24 '16 at 12:24
  • @harold thanks, you are right - AVX supports unaligned access, though the practice recommended by Intel is to align the data. SSE doesn't support unaligned operands however. I clarified these points in my answer. – paul-g Oct 24 '16 at 13:17