How can I reason about performance effects (of changing to a more compact representation for 48-bit integers)?

Question

Similar to this question What's the actual effect of successful unaligned accesses on x86?

We have an application which searches through a massive array of structures for which the key is a 48 bit integer. We have always used a 64-bit type to represent the 48-bit integer on the assumption that space is cheap and alignment improves performance.

The space used by the array is huge. So I am curious as to what effect switching to a compact representation would actually have.

This is a performance question so the essential answer is measure. Also accepting that performance can be most greatly affected by choosing the best algorithm for your usage pattern.

The question is what can we hypothesize in advance of measuring?

Effectively changing:

struct Entry
{
   uint64_t key;
   int32_t value1;
   int32_t value2;   
};
std::vector<Entry> foobar;

to say:

struct uint48
{
   uint16_t v[3];
}

struct Entry
{
   uint48 key;
   int32_t value1;
   int32_t value2;   
} __attribute__((packed));
std::vector<Entry> snafu;

Making this change in practice would save a fair bit of memory. I am curious what effect I could expect it to have on performance in practice.

The only way to know for sure is of course to measure but this would require a not insignificant amount of effort.

What can I reason about the expected effects?

This is my starting point:

Searching through the structure consists of a number of operations based around on operations on the 48-bit key (mostly hamming distance rather than equality).
The structure remains read-only throughout.
There are heuristics which mean that successive searches should use local data and hopefully exploit CPU caching.
Assume an x86_64 based architecture with at least 'standard' add-ons like SSE
Using g++ on Linux

On the one hand we have:

An Entry is currently 128bits = 16 bytes wide.

Reducing it to 14 will remove any 16 byte alignment benefit if such a thing exists.
Extracting a 48 bit integer becomes more expensive.

i.e. uint64_t foo = someEntry.key;

On the other hand:

Reducing the size of each entry increases the chance of wanted data being in a cache.

The data set overall is way too big to completely fit in any cache.

The cost of reducing the memory use is likely a reduction in performance but is there a way to estimate it that would justify performing the actual experiment?

For example if we expect a 50% loss of performance its probably not worth even trying. If its likely to be more like 5% on the other hand it may be worth doing.

I hope to perform the experiment in the future regardless but it is likely to be quite a distant future due to other commitments.

I think this is an interesting question in terms of how you should approach thinking about such problems.

Optimizing is a rabbit hole from algorithms down to hand crafted assembly. There is a lot you can do and many pitfalls on the way. I'm familiar with it from the algorithms side but much less so when it comes to understanding how the a change made higher up effects how things run on the CPU.

Generally we've scaled by adding cores and RAM and nodes with more cores as these are cheaper than the programmer time and support cost of optimising/recompiling for different commodity hardware. But the exercise is very educational and there are likely many gains to be had from low hanging fruit.

This question originally used another definition suggested for uint48 from (here)[https://stackoverflow.com/a/26198075/1569204]:

struct uint48
{
  unsigned long long v:48;
} __attribute__((packed));

It was pointed out that this may not actually give a 6 byte layout. This is left in as otherwise the answers referencing it do not make sense.

The answers so far have suggested switching from an AoS layout to a SoA layout. This is indeed a very good point. It is likely a better change to make than the one I am discussing alone. In that sense they answer the performance question. This has also encouraged more thinking and asking myself some better questions.

What has not been included so far is discussion about optimal layouts for 48bit types and the actual effects of alignment.

Consider:

uint64_t key:48;

int64_t x = key;

vs.

struct uint48
{
   int32_t msb;
   int16_t lsb;
};
uint48 key;
int64_t x = (int64_t)key.msb + (int64_t)key.lsb;

vs.

struct uint48
{
   int8_t v[6];
};
uint48 key;

etc.

Also how important is alignment relative to other considerations?

If you are using **clang**, you might be able to leverage [`_ExtInt(48)`](https://blog.llvm.org/2020/04/the-new-clang-extint-feature-provides.html) extension for your use case. — Eljay, Apr 29 '22 at 20:25
Can you split the `Entry` struct into two parallel vectors so that keys and values are separated (similarly to "columnar data formats", https://en.wikipedia.org/wiki/Column-oriented_DBMS)? If you search through all entries and only access a few of the values, it should be faster regardless of the alignment, because caches usually work blockwise -- all of `key`, `value1` and `value2` are always loaded when accessed, which is a huge waste if only `key` is accessed. — krlmlr, Apr 29 '22 at 20:25
@Eljay using gcc here - https://stackoverflow.com/questions/61412470/will-clangs-custom-int-size-feature-ever-get-added-to-gcc — Bruce Adams, Apr 30 '22 at 00:21

Ben Voigt · Answer 1 · 2022-04-29T21:51:14.227

Your understanding of the C and C++ 'bitfield' feature is terribly lacking

Your struct uint48 is saying "Allocate me an unsigned long long. Now place v inside it using 48 bits leaving 16 bits of padding.".

A group of bitfield members is always padded out to the size of its foundational type. (sizeof (uint48)) >= (sizeof (unsigned long long))

You can only have memory savings when you have multiple non-static bitfields inside the same structure with the same foundational type, they are declared adjacent to each other, and more than one adjacent field fits within the foundational type for the bitfields.

Your change had no effect on memory consumption.

To make search performance better, use a better search algorithm

You mention that you are searching for inexact matches based on Hamming distance. You need to sort your structure and implement lookup that takes advantage of that sorting to perform pruning.

If you are looking for matches with a Hamming distance of 4 or fewer, and there are 6 bits different in the first 16, you should skip over the entire block because no matter what values appear in the next 32 bits, you'll never get a total distance of 4. So you can prune the entire subspace and not search through it.

A radix tree is likely going to be a suitable structure (maybe not optimal but a huge improvement over what you have) because the prefix distance provides a lower bound for the distance deeper in that branch of the tree allowing you to skip (prune) large sections of the tree.

This sort of data structure subsumes @krlmlr's excellent point about moving the associated values away from the key list, because in a radix tree the key is actually implicit in the tree structure, and you never reach the last level where the leaf values are stored (bringing them into cache) until you have already processed the key and verified that your criteria are met.

If you don't know the allowed weight of mismatch, you should still use some branch-and-bound techniques.

In fairness to me on the first point. I nicked the definition of 48-bit integer from this answer - https://stackoverflow.com/a/26198075/1569204 . I did not examine it in any detail. I would have probably used a fixed length array of 6 uint8_t or some other tomfoollery. — Bruce Adams, Apr 29 '22 at 23:58
"To make search performance better, use a better search algorithm". Yes of course but that is missing the point of the question which is how can to reason about the effect of changes, like the indicated one. (In fact the algorithm used is quite efficient. I deliberately glossed over the details.) — Bruce Adams, Apr 30 '22 at 00:16
@BruceAdams • If you want to visualize your bitfield layout, I have an example here: https://stackoverflow.com/a/69227655/4641116 — Eljay, Apr 30 '22 at 00:26

score 1 · Answer 2 · answered Apr 29 '22 at 20:22

struct uint48
{
  unsigned long long v:48;
} __attribute__((packed));

Do not do that. It tells the compiler the object might be just byte-aligned, so it generally has to use unaligned loads to access it. You would likely be better off with three uint16_t, which the compiler can at least assume are two-byte aligned. Or maybe by telling the compiler it is two-byte aligned instead of packed. (In some situations, depending on cache patterns and other factors, it might even be beneficial to keep one array of uint32_t and another array of uint16_t, and then the compiler can load each with one instruction appropriate for that type.)

Assume an x86_64 based architecture.

That is inadequate to gauge performance. There are a lot of different processor models in that architecture, and they can be configured differently in different systems, with different operating systems applying different settings to them.

Reducing it to 14 will remove any 16 byte alignment benefit if such a thing exists.

There are definitely 16-byte alignment benefits on some x86-64 processors. It is unknown, from the information in your question, whether you are using any of them. If you have diverse fields in a 16-byte structure, I doubt there is much benefit. (Uniform fields possibly could benefit from various SIMD features, like SSE2, AVX, and so on.)

Reducing the size of each entry increases the chance of wanted data being in a cache.

Cache behavior can often be improved more by careful consideration of use patterns and data layout and subsequent algorithm redesign based on those considerations than by slight decreases in the size of data used.

The cost of reducing the memory use is likely a reduction in performance but is there a way to estimate it that would justify performing the actual experiment?

Yes, if you have sufficient information about the data access patterns of your algorithm, the hardware you are using, and so on. No such information is present in the posted question.

With regard to "Assume an x86_64 based architecture / That is inadequate to gauge performance" I don't have that control over the processors that clients happen to use. Currently its left to the compiler to decide how to do things. I would be able to specify something like "must support SSE2" if it could be justified. I am of course free to give the compiler options deemed appropriate. — Bruce Adams, Apr 30 '22 at 00:07
"Yes, if you have sufficient information". I guess this is a point of the question. What information would be sufficient? how do you even gather it. The original developer believed that the current algorithm made good use of caching but I don't believe know how it was determined or if it was actually measured. — Bruce Adams, Apr 30 '22 at 00:10

score 0 · Answer 3 · answered Apr 30 '22 at 06:48

0

I would expect a worth to consider option with

struct Keys {
    uint32_t keys_msb_bits[N];
    uint16_t keys_lsb_bits[N];
};
struct Values {
    int32_t value1[N];
    int32_t value2[N];
};

With N==8, each struct would not be horribly aligned. The Values could be each aligned to a cache line boundary, while the Keys would be aligned by 16 bytes (suitable for ARM Neon, Intel up to SSE4.1).

It all depends of course the memory access patterns of this data, which now has a bit more complex address computation, but it would also have an opportunity to search/replace using SIMD.

answered Apr 30 '22 at 06:48

Aki Suihkonen

19,144
1
36
57

This makes more sense if you link to https://stackoverflow.com/questions/3928995/how-do-cache-lines-work . Presumably you mean to use something like -fvectorise to enable SIMD? – Bruce Adams Apr 30 '22 at 07:17
SIMD is enabled already with -O3 or even -O2 in gcc. Rather this kind of memory layout serves first fast access through 64-byte cache lines, but it also serves loading different sized data to adjacent lanes in SIMD registers. No compiler will write the algorithm automatically. – Aki Suihkonen Apr 30 '22 at 15:20

How can I reason about performance effects (of changing to a more compact representation for 48-bit integers)?

3 Answers3