This is a lot more complex than what you are trying to describe, as there are requirements for alignment on objects and items within objects. For example, if the compiler decides that an integer item is 16 bytes into a struct
or class
, it may well decide that "ah, I can use an aligned SSE instruction to load this data, because it is aligned at 16 bytes" (or something similar in ARM, PowerPC, etc). So if you do not satisfy AT LEAST that alignment in your code, you will cause the program to go wrong (crash or misread the data, depending on the architecture).
Typically, the alignment used and given by the compiler will be "right" for whatever architecture the compiler is targeting. Changing it will often lead to worse performance. Not always, of course, but you'd better know exactly what you are doing before you fiddle with it. And measure the performance before/after, and test thoroughly that nothing has been broken.
The padding is typically just to the next "minimum alignment for the largest type" - e.g. if a struct
contains only int
and a couple of char
variables, it will be padded to 4 bytes [inside the struct and at the end, as required]. For double
, padding to 8 bytes is done to ensure, but three double
will, typically, take up 8 * 3 bytes with no further padding.
Also, determining what hardware you are executing on (or will execute on) is probably better done during compilation, than during runtime. At runtime, your code will have been generated, and the code is already loaded. You can't really change the offsets and alignments of things at this point.
If you are using the gcc or clang compilers, you can use the __attribute__((aligned(n)))
, e.g. int x[4] __attribute__((aligned(32)));
would create a 16-byte array that is aligned to 32 bytes. This can be done inside structures or classes as well as for any variable you are using. But this is a compile-time option, can not be used at runtime.
It is also possible, in C++11 onwards, to find out the alignment of a type or variable with alignof
.
Note that it gives the alignment required for the type, so if you do something daft like:
int x;
char buf[4 * sizeof(int)];
int *p = (int *)buf + 7;
std::cout << alignof(*p) << std::endl;
the code will print 4, although the alignment of buf+7
is probably 3 (7 modulo 4).
Types can not be chosen at runtime. C++ is a statically typed language: the type of something is determined at runtime - sure, classes that derive from a baseclass can be created at runtime, but for any given object, it has ONE TYPE, always and forever until it is no longer allocated.
It is better to make such choices at compile-time, as it makes the code much more straight forward for the compiler, and will allow better optimisation than if the choices are made at runtime, since you then have to make a runtime decision to use branch A or branch B of some piece of code.
As an example of aligned vs. unaligned access:
#include <cstdio>
#include <cstdlib>
#include <vector>
#define LOOP_COUNT 1000
unsigned long long rdtscl(void)
{
unsigned int lo, hi;
__asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}
struct A
{
long a;
long b;
long d;
char c;
};
struct B
{
long a;
long b;
long d;
char c;
} __attribute__((packed));
std::vector<A> arr1(LOOP_COUNT);
std::vector<B> arr2(LOOP_COUNT);
int main()
{
for (int i = 0; i < LOOP_COUNT; i++)
{
arr1[i].a = arr2[i].a = rand();
arr1[i].b = arr2[i].b = rand();
arr1[i].c = arr2[i].c = rand();
arr1[i].d = arr2[i].d = rand();
}
printf("align A %zd, size %zd\n", alignof(A), sizeof(A));
printf("align B %zd, size %zd\n", alignof(B), sizeof(B));
for(int loops = 0; loops < 10; loops++)
{
printf("Run %d\n", loops);
size_t sum = 0;
size_t sum2 = 0;
unsigned long long before = rdtscl();
for (int i = 0; i < LOOP_COUNT; i++)
sum += arr1[i].a + arr1[i].b + arr1[i].c + arr1[i].d;
unsigned long long after = rdtscl();
printf("ARR1 %lld sum=%zd\n",(after - before), sum);
before = rdtscl();
for (int i = 0; i < LOOP_COUNT; i++)
sum2 += arr2[i].a + arr2[i].b + arr2[i].c + arr2[i].d;
after = rdtscl();
printf("ARR2 %lld sum=%zd\n",(after - before), sum2);
}
}
[Part of that code is taken from another project, so it's perhaps not the neatest C++ code ever written, but it saved me writing code from scratch, that isn't relevant to the project]
Then the results:
$ ./a.out
align A 8, size 32
align B 1, size 25
Run 0
ARR1 5091 sum=3218410893518
ARR2 5051 sum=3218410893518
Run 1
ARR1 3922 sum=3218410893518
ARR2 4258 sum=3218410893518
Run 2
ARR1 3898 sum=3218410893518
ARR2 4241 sum=3218410893518
Run 3
ARR1 3876 sum=3218410893518
ARR2 4184 sum=3218410893518
Run 4
ARR1 3875 sum=3218410893518
ARR2 4191 sum=3218410893518
Run 5
ARR1 3876 sum=3218410893518
ARR2 4186 sum=3218410893518
Run 6
ARR1 3875 sum=3218410893518
ARR2 4189 sum=3218410893518
Run 7
ARR1 3925 sum=3218410893518
ARR2 4229 sum=3218410893518
Run 8
ARR1 3884 sum=3218410893518
ARR2 4210 sum=3218410893518
Run 9
ARR1 3876 sum=3218410893518
ARR2 4186 sum=3218410893518
As you can see, the code that is aligned, using arr1
takes around 3900 clock-cycles, and the one using arr2
takes around 4200 cycles. So 300 cycles in roughly 4000 cycles, some 7.5% if my "menthol arithmetic" is works correctly.
Of course, like so many different things, it really depends on the exact situation, how the objects are used, what the cache-size is, exactly what processor it is, how much other code and data in other places around it also using cache-space. The only way to be certain is to experiment with YOUR code.
[I ran the code several times, and although I didn't always get the same results, I always got similar proportional results]