Data alignment to enable vectorization / efficient cache access

Question

This book says the following:

For Knights Landing, memory movement is optimal when the data starting address lies on 64-byte boundaries.

Q1. Is there a way to query the processor in C++ code dynamically to know what this optimal n-byte boundary would be for the processor on which the application is currently running? That way, the code would be portable.

The book further states:

As programmers, we end up with two jobs: (1)align our data and (2)make sure the compiler knows it is aligned.

(Suppose for the question below that we know that it is optimal for our processor to have data start at 64-byte boundaries.)

What exactly is this "data" though?

Suppose I have a class thus:

class Class1_{
    private: 
    int a;//4 bytes
    double b;//8 bytes
    std::vector<int> potentially_longish_vector_int;
    std::vector<double> potentially_longish_vector_double;
    double * potentially_longish_heap_array_double;
    public:
    //--stuff---//
    double * return_heap_array_address() {return potentially_longish_heap_array_double;}
}

Suppose I also have functions that are prototyped thus:

void func1(Class1_& obj_class1);

void func2(double* array);

That is, func1 takes in an object of Class1_ by reference, and func2 is called as func2(obj_class1.return_heap_array_address());

To be consistent with the advice that data should be appropriately boundary aligned, should obj_class1 itself be 64-byte boundary aligned for efficient functioning of func1()? Should potentially_longish_heap_array_double be 64-byte boundary aligned for efficient functioning of func2()?

For alignment of other data members of the class which are STL containers, the thread here suggests how to go about accomplishing the required alignment.

Q2. So, does the object itself need to be appropriately aligned as well as all of the data members within it?

The usual way of checking the requirement for alignment is to do cpuid + table check the alignment.... VEry manual. — Matthieu Brucher, Nov 23 '18 at 16:43
For the alignment issue, I think (because it's stated data alignment) it's mainly for vector operations, you want them to start with no "peel", hence you want arrays to be 64-bytes (or bit??) aligned. — Matthieu Brucher, Nov 23 '18 at 16:44
What exactly is `memory movement` referring to? There are trade-offs for alignment. It may not in all situations be preferable. What usage situation is Q2 considering? — , Nov 23 '18 at 16:45
@MaximEgorushkin the advice is in a general chapter devoted to vectorization. No specific algorithm. — Tryer, Nov 23 '18 at 16:46
It is the code that gets vectorized, not data. You need to show the code that you would like to vectorize. Note, that `memcpy` and `memset` are already vectorized. — Maxim Egorushkin, Nov 23 '18 at 16:48
@eukaryota memory movement is talked about in the context that memory between caches is read from a 64 byte multiple address. So, if we are referring to a double that starts at address 100, the cache would be filled potentially with byte #64 through till byte #127 and the 100th byte is accessed via an offset into this. — Tryer, Nov 23 '18 at 16:48
As for Q1: Based on the comment this seems to be talking about the cache line size. In that case related: https://stackoverflow.com/questions/39680206/understanding-stdhardware-destructive-interference-size-and-stdhardware-cons — , Nov 23 '18 at 16:51
For Q2: The answer will depend on what `func1` and `func2` are going to do. The data alignment for performance has nothing to do with function calls. Also for Q1: does "dynamically" mean at runtime? My previous link talks about a compile-time feature only. — , Nov 23 '18 at 16:57
@eukaryota Yes, of course. Suppose memory size is not a problem, is it not always better to have both the object begin at a 64 byte boundary as well as (in the extreme case) every data member (more specifically stack/heap allocated arrays/STL vectors) begin on the said byte boundary? — Tryer, Nov 23 '18 at 17:03
@Tryer No that is a bad idea, because then each variable will occupy a whole cache line and there are only few of these that can be cached at a time, so in effect many more of your memory accesses would cause cache misses. — , Nov 23 '18 at 17:08

Maxim Egorushkin · Accepted Answer · 2018-11-23T17:12:07.400

5

In general, when you align your arrays on a cache line boundary that maximises cache utilisation and that also makes the arrays suitably aligned for any SIMD instructions. That is because the unit of transfer between RAM and CPU caches is a cache line, which is 64 bytes on modern Intel CPUs.

However, increased alignment may also waste memory and reduce cache utilization. Normally only data structures on the critical fast path of your application may require specifying an increased alignment.

It makes sense to arrange members of your classes in {hotness, size} order, so that most frequently accessed members or members accessed together reside on the same cache line.

Optimization objective here is to reduce cache and TLB misses (or, decrease cycles-per-instruction / increase instructions-per-cycle). TLB misses can be reduced by using huge pages.

edited Nov 23 '18 at 17:12

answered Nov 23 '18 at 16:51

Maxim Egorushkin

131,725
17
180
271

Yes. So, is it better to start an object on a 64 byte boundary and also every data member within it? Some functions may want to access the `int a` and `double b` some may want to access the heap allocated array. – Tryer Nov 23 '18 at 16:58
1

@Tryer It depends on how the object and its members are used. – Maxim Egorushkin Nov 23 '18 at 17:04
yes, good point about the cache utilization if there are too many "empty holes". – Tryer Nov 23 '18 at 17:05

Data alignment to enable vectorization / efficient cache access

1 Answers1

Related