Similar to this question What's the actual effect of successful unaligned accesses on x86?
We have an application which searches through a massive array of structures for which the key is a 48 bit integer. We have always used a 64-bit type to represent the 48-bit integer on the assumption that space is cheap and alignment improves performance.
The space used by the array is huge. So I am curious as to what effect switching to a compact representation would actually have.
This is a performance question so the essential answer is measure. Also accepting that performance can be most greatly affected by choosing the best algorithm for your usage pattern.
The question is what can we hypothesize in advance of measuring?
Effectively changing:
struct Entry
{
uint64_t key;
int32_t value1;
int32_t value2;
};
std::vector<Entry> foobar;
to say:
struct uint48
{
uint16_t v[3];
}
struct Entry
{
uint48 key;
int32_t value1;
int32_t value2;
} __attribute__((packed));
std::vector<Entry> snafu;
Making this change in practice would save a fair bit of memory. I am curious what effect I could expect it to have on performance in practice.
The only way to know for sure is of course to measure but this would require a not insignificant amount of effort.
What can I reason about the expected effects?
This is my starting point:
- Searching through the structure consists of a number of operations based around on operations on the 48-bit key (mostly hamming distance rather than equality).
- The structure remains read-only throughout.
- There are heuristics which mean that successive searches should use local data and hopefully exploit CPU caching.
- Assume an x86_64 based architecture with at least 'standard' add-ons like SSE
- Using g++ on Linux
On the one hand we have:
An
Entry
is currently 128bits = 16 bytes wide.Reducing it to 14 will remove any 16 byte alignment benefit if such a thing exists.
Extracting a 48 bit integer becomes more expensive.
i.e.
uint64_t foo = someEntry.key;
On the other hand:
Reducing the size of each entry increases the chance of wanted data being in a cache.
The data set overall is way too big to completely fit in any cache.
The cost of reducing the memory use is likely a reduction in performance but is there a way to estimate it that would justify performing the actual experiment?
For example if we expect a 50% loss of performance its probably not worth even trying. If its likely to be more like 5% on the other hand it may be worth doing.
I hope to perform the experiment in the future regardless but it is likely to be quite a distant future due to other commitments.
I think this is an interesting question in terms of how you should approach thinking about such problems.
Optimizing is a rabbit hole from algorithms down to hand crafted assembly. There is a lot you can do and many pitfalls on the way. I'm familiar with it from the algorithms side but much less so when it comes to understanding how the a change made higher up effects how things run on the CPU.
Generally we've scaled by adding cores and RAM and nodes with more cores as these are cheaper than the programmer time and support cost of optimising/recompiling for different commodity hardware. But the exercise is very educational and there are likely many gains to be had from low hanging fruit.
This question originally used another definition suggested for uint48 from (here)[https://stackoverflow.com/a/26198075/1569204]:
struct uint48
{
unsigned long long v:48;
} __attribute__((packed));
It was pointed out that this may not actually give a 6 byte layout. This is left in as otherwise the answers referencing it do not make sense.
The answers so far have suggested switching from an AoS layout to a SoA layout. This is indeed a very good point. It is likely a better change to make than the one I am discussing alone. In that sense they answer the performance question. This has also encouraged more thinking and asking myself some better questions.
What has not been included so far is discussion about optimal layouts for 48bit types and the actual effects of alignment.
Consider:
uint64_t key:48;
int64_t x = key;
vs.
struct uint48
{
int32_t msb;
int16_t lsb;
};
uint48 key;
int64_t x = (int64_t)key.msb + (int64_t)key.lsb;
vs.
struct uint48
{
int8_t v[6];
};
uint48 key;
etc.
Also how important is alignment relative to other considerations?