Custom data size for memory alignment

Question

Each datatype has a certain range, based on the hardware. For example, on a 32bit machine an int has the range -2147483648 to 2147483647.

C++ compilers 'pad' object memory to fit into certain sizes. I'm pretty sure it's 2, 4, 8, 16, 32, 64 etc. This also probably depends on the machine.

I want to manually align my objects to meet padding requirements. Is there a way to:

Determine what machine a program is running on
Determine padding sizes
Set custom data-type based on bitsize

I've used bitsets before in Java, but I'm not familiar with C++. As for machine requirements, I know programs for different hardware sets are usually compiled differently in C++, so I'm wondering if its even possible at all.

Example->

/*getHardwarePack size obviously doesn't exist, just here to explain. What I'm trying to get  
here would be the minimum alignment size for the machine the program is running on*/

#define PACK_SIZE = getHardwarePackSize();
#define MONTHS = 12;

class date{

    private:

           //Pseudo code that represents making a custom type
           customType monthType = MONTH/PACK_SIZE; 

           monthType.remainder  = MONTH % PACK_SIZE;

           monthType months = 12;
};

The idea is to be able to fit every variable into the minimum bit size and track how many bits are left over.

Theoretically, it would be possible to make use of every unused bit and improve memory efficiency. Obviously this would never work anything like this, but the example is just to explain the concept.

So what is your question? All you've done is say what you want to do. You need to specifiy what compiler are you using. What have you tried to achieve your goal and what were the results. etc. — Dale Wilson, Nov 18 '15 at 21:43
Well for one, I want to be able to specify variable to a specific size, so instead of int which is much too large usually, types that are multiples of the minimum size. (Which will be determined by machine type and padding sizes.) — bigcodeszzer, Nov 18 '15 at 21:58
I also want to create my own library to work with, which will allow my programs to minimize padding for program efficiency. — bigcodeszzer, Nov 18 '15 at 21:59
On the object level, that is. So basically, 'auto object memory alignment' — bigcodeszzer, Nov 18 '15 at 22:00
Can you give an example or two of what you expect to be able to do? It's hard to understand your goal just from the description. — John Kugelman, Nov 18 '15 at 22:55

Mats Petersson · Accepted Answer · 2015-11-19T00:18:37.563

3

This is a lot more complex than what you are trying to describe, as there are requirements for alignment on objects and items within objects. For example, if the compiler decides that an integer item is 16 bytes into a struct or class, it may well decide that "ah, I can use an aligned SSE instruction to load this data, because it is aligned at 16 bytes" (or something similar in ARM, PowerPC, etc). So if you do not satisfy AT LEAST that alignment in your code, you will cause the program to go wrong (crash or misread the data, depending on the architecture).

Typically, the alignment used and given by the compiler will be "right" for whatever architecture the compiler is targeting. Changing it will often lead to worse performance. Not always, of course, but you'd better know exactly what you are doing before you fiddle with it. And measure the performance before/after, and test thoroughly that nothing has been broken.

The padding is typically just to the next "minimum alignment for the largest type" - e.g. if a struct contains only int and a couple of char variables, it will be padded to 4 bytes [inside the struct and at the end, as required]. For double, padding to 8 bytes is done to ensure, but three double will, typically, take up 8 * 3 bytes with no further padding.

Also, determining what hardware you are executing on (or will execute on) is probably better done during compilation, than during runtime. At runtime, your code will have been generated, and the code is already loaded. You can't really change the offsets and alignments of things at this point.

If you are using the gcc or clang compilers, you can use the __attribute__((aligned(n))), e.g. int x[4] __attribute__((aligned(32))); would create a 16-byte array that is aligned to 32 bytes. This can be done inside structures or classes as well as for any variable you are using. But this is a compile-time option, can not be used at runtime.

It is also possible, in C++11 onwards, to find out the alignment of a type or variable with alignof.

Note that it gives the alignment required for the type, so if you do something daft like:

 int x;
 char buf[4 * sizeof(int)];
 int *p = (int *)buf + 7;
 std::cout << alignof(*p) << std::endl;

the code will print 4, although the alignment of buf+7 is probably 3 (7 modulo 4).

Types can not be chosen at runtime. C++ is a statically typed language: the type of something is determined at runtime - sure, classes that derive from a baseclass can be created at runtime, but for any given object, it has ONE TYPE, always and forever until it is no longer allocated.

It is better to make such choices at compile-time, as it makes the code much more straight forward for the compiler, and will allow better optimisation than if the choices are made at runtime, since you then have to make a runtime decision to use branch A or branch B of some piece of code.

As an example of aligned vs. unaligned access:

#include <cstdio>
#include <cstdlib>
#include <vector>

#define LOOP_COUNT 1000

unsigned long long rdtscl(void)
{
    unsigned int lo, hi;
    __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
    return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}

struct A
{
    long a;
    long b;
    long d;
    char c;
};

struct B 
{
    long a;
    long b;
    long d;
    char c;
} __attribute__((packed));

std::vector<A> arr1(LOOP_COUNT);
std::vector<B> arr2(LOOP_COUNT);


int main()
{
    for (int i = 0; i < LOOP_COUNT; i++)
    {
    arr1[i].a = arr2[i].a = rand();
    arr1[i].b = arr2[i].b = rand();
    arr1[i].c = arr2[i].c = rand();
    arr1[i].d = arr2[i].d = rand();
    }

    printf("align A %zd, size %zd\n", alignof(A), sizeof(A));
    printf("align B %zd, size %zd\n", alignof(B), sizeof(B));
    for(int loops = 0; loops < 10; loops++)
    {
    printf("Run %d\n", loops);
    size_t sum = 0;
    size_t sum2 = 0;
    unsigned long long before = rdtscl();
    for (int i = 0; i < LOOP_COUNT; i++)
        sum += arr1[i].a + arr1[i].b + arr1[i].c + arr1[i].d;
    unsigned long long after = rdtscl();
    printf("ARR1 %lld sum=%zd\n",(after - before),  sum);

    before = rdtscl();
    for (int i = 0; i < LOOP_COUNT; i++)
        sum2 += arr2[i].a + arr2[i].b + arr2[i].c + arr2[i].d;
    after = rdtscl();
    printf("ARR2 %lld sum=%zd\n",(after - before),  sum2);
    }
}

[Part of that code is taken from another project, so it's perhaps not the neatest C++ code ever written, but it saved me writing code from scratch, that isn't relevant to the project]

Then the results:

$ ./a.out
align A 8, size 32
align B 1, size 25
Run 0
ARR1 5091 sum=3218410893518
ARR2 5051 sum=3218410893518
Run 1
ARR1 3922 sum=3218410893518
ARR2 4258 sum=3218410893518
Run 2
ARR1 3898 sum=3218410893518
ARR2 4241 sum=3218410893518
Run 3
ARR1 3876 sum=3218410893518
ARR2 4184 sum=3218410893518
Run 4
ARR1 3875 sum=3218410893518
ARR2 4191 sum=3218410893518
Run 5
ARR1 3876 sum=3218410893518
ARR2 4186 sum=3218410893518
Run 6
ARR1 3875 sum=3218410893518
ARR2 4189 sum=3218410893518
Run 7
ARR1 3925 sum=3218410893518
ARR2 4229 sum=3218410893518
Run 8
ARR1 3884 sum=3218410893518
ARR2 4210 sum=3218410893518
Run 9
ARR1 3876 sum=3218410893518
ARR2 4186 sum=3218410893518

As you can see, the code that is aligned, using arr1 takes around 3900 clock-cycles, and the one using arr2 takes around 4200 cycles. So 300 cycles in roughly 4000 cycles, some 7.5% if my "menthol arithmetic" is works correctly.

Of course, like so many different things, it really depends on the exact situation, how the objects are used, what the cache-size is, exactly what processor it is, how much other code and data in other places around it also using cache-space. The only way to be certain is to experiment with YOUR code.

[I ran the code several times, and although I didn't always get the same results, I always got similar proportional results]

edited Nov 19 '15 at 00:18

answered Nov 18 '15 at 23:31

Mats Petersson

126,704
14
140
227

Except for very special cases, this is probably not worth it. The space saved is typically small, and the amount of extra code more than fills the space saved in data space. Only meaningful in (some) hardware design, where accessing individual bits is done as individual bits anyway, so you don't need extra code to extract the bits you need. – Mats Petersson Nov 18 '15 at 23:36
So are you saying that it's impossible to determine because of unknown compiler behavior (unless of course you fully understand the compiler). – bigcodeszzer Nov 18 '15 at 23:36
Hmmm. I'm not trying to actually /force/ the alignment, just assumed that if you handed the compiler already optimized aligned data, that it would be faster? – bigcodeszzer Nov 18 '15 at 23:37
When you say the extra code, what do you mean specifically? What type of instructions take up RAM? – bigcodeszzer Nov 18 '15 at 23:38
Uh, that's why the compiler pads and aligns things in the first place, because, except for rather special cases, it knows what the "best" alignment is. Of course, this is a compromise - it may, sometimes, be faster to align one `int` to 16 or 32 bytes. But usually just 4 bytes alignment is fine. – Mats Petersson Nov 18 '15 at 23:39
I thought that padding was excess data? Doesn't excess imply that it's not optimal. (Not saying I know, I'm just curious.) – bigcodeszzer Nov 18 '15 at 23:40
Unless you have separate RAM and ROM in your computer (such as very specific small embedded systems), instructions and data are both in RAM. Are you designing something that is going into a (not-smart) watch? Or maybe the controls for a microwave oven, washing machine, etc? If so, reducing padding so it fits in 8KB of RAM rather than 12KB may be useful, and assuming there is space in ROM, the extra instructions will live in ROM. – Mats Petersson Nov 18 '15 at 23:42
I'm thinking more in terms of generating thousands of small objects. If each object needs to be padded a few bits, at what point (hundreds of thousands, millions?) will it be a perceived speed detriment? – bigcodeszzer Nov 18 '15 at 23:44
So, the padding is "excess" data in the sense that you typically can't use it, because it is placed between two items in a data structure (`struct` or `class`). But it's there to improve the performance of the code [and in some cases, it's required for correct operation, for example my old 68000 machine would not read an 32-bit `int` unless it was aligned to 4 bytes - it would issue an "odd address trap"). – Mats Petersson Nov 18 '15 at 23:44
If you have millions of objects that are "badly aligned", the access to those elements of the objects will be slower [in most cases] than if they are "well aligned". And it can make a pretty decent difference. But without understanding EXACTLY what you are doing, it's hard to give precise advice. I'll hack up a little example, to show what I mean. See edit in a little bit. – Mats Petersson Nov 18 '15 at 23:46
Well I'm dealing with things like AI algorithms and collision detection/physics. So generally speaking: the less ram you use, the better. I just assume having each object perfectly aligned would be a good idea. – bigcodeszzer Nov 18 '15 at 23:48
I was thinking like, for example, you could push all the bits that exceed the padding limit, into a universal 'leftovers' object... so as to improve the speed. – bigcodeszzer Nov 18 '15 at 23:48
It is very hard to do that without making the code noticeably slower, since you need extra code to "dig out" those parts in some way. Most likely not worth it. – Mats Petersson Nov 19 '15 at 00:20
@bigcodeszzer: Example code showing the slowing down of data-access when it is unaligned. – Mats Petersson Nov 19 '15 at 00:24
Digging it out may not be worth it but you could possibly have a child object for each polymorphic tree that gathers all the excess bits. Or there's something also to be said about design practice as far as knowing the padding limit. – bigcodeszzer Nov 19 '15 at 00:27
Interesting example but I'd prolly say having that 7.5% difference is significant, especially when you add up a lot of aligned vs unaligned objects across a large program. – bigcodeszzer Nov 19 '15 at 00:29
Yes, the aligned version is 7.5% faster, which is definitely worth having in most programs. That's what I was saying in the first place - you don't want to "mess" with the alignment and padding. – Mats Petersson Nov 19 '15 at 00:30
Right, but what I'm suggesting is that the programs are aligned manually, with every object size being in multiples of the minimum length. For example, in Java, that min length is 8. So the program is designed so all objects fit multiples of 8's. – bigcodeszzer Nov 19 '15 at 00:40
What I'm suggesting is, if the objects can't meet the 8bit requirement, that the excess data is cut off and then stuck into another object so that it /will/ fit in division of 8. This way there is no unused padding data. – bigcodeszzer Nov 19 '15 at 00:42
Are you talking about bitfields in structs? Yes, but then you need extra `and` and `shift` instructions to extract the data. It's NEARLY always worse than having 8-bit (or bigger) data items in the first place. – Mats Petersson Nov 19 '15 at 00:46
Well, in the case of a data type, yeah, it would be harder. But I'm also talking about object byte length. Objects are also padded. So all I'm saying is, you remove the trailing data that exceeds the limit, and put all trailing data in a different object. – bigcodeszzer Nov 19 '15 at 00:56
In the case of data types, I suppose you could use boolean arrays instead of normal types, and compute the binary numbers. That's the only thing I can think of. – bigcodeszzer Nov 19 '15 at 00:57
But any such operation will require bit-shifts and and/or operations to separate out/combine elements. It's significantly slower, unless it is quite rare to use such data. – Mats Petersson Nov 19 '15 at 01:00
Well everything is in binary ultimately. If you track indices, theoretically you could manipulate different arrays mathematically. How does the cpu work with binary in terms of int, double etc. Is it different? – bigcodeszzer Nov 19 '15 at 01:19
Well, if it's nicely aligned and available as a 32-bit "object", the CPU can just load it. It is not "converted to binary", it is always binary (except when reading in text from the console or a file, where it is converted to binary as part of the read function). And of course, the reverse when printing it. – Mats Petersson Nov 19 '15 at 01:26
Which would increase the speed, no? That is what I'm after. I wonder if it could be constructed into a library to make it automatic. – bigcodeszzer Nov 19 '15 at 01:34
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/95502/discussion-between-bigcodeszzer-and-mats-petersson). – bigcodeszzer Nov 19 '15 at 01:34
http://stackoverflow.com/questions/33829894/memory-efficient-ai-objects-for-physics-game – bigcodeszzer Nov 21 '15 at 02:38

Custom data size for memory alignment

1 Answers1

Linked