3

I'm writing a delegate library that is supposed to have absolutely no overhead. Therefore it's important that the access of a function pointer is done as fast as possible.

So my question is: Does the access speed depend on the member position in a class? I heard that the most important member should be the first in the member declaration, and that make sense to me, because that means that the this pointer of a class points to the same address as the important member (assuming non-virtual classes). Whereas if the important member would be at any other position, the CPU would have to calculate it's position by adding this and the offset in class layout.

On the other hand I know that the compiler represents that address as a qword-ptr, which contains the information of the offset.

So my question comes down to: Does resolving a qword-ptr take a constant time or does it increase if the offset is not 0? Does the behaviour stay the same on different plattforms?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Jack Sabbath
  • 1,398
  • 2
  • 9
  • 12
  • 3
    Most common architectures can add small displacements at no runtime cost although for example on x86 you need longer machine code. – Jester Mar 23 '20 at 01:24
  • Yes it matters because if your class members occupy two cachelines rather than one you are effectively doubling your working set – user997112 Apr 29 '20 at 06:05

2 Answers2

6

Most machines have a load instruction or addressing mode that can include a small constant displacement for no extra cost.

On x86, [reg] vs. [reg + disp8] costs 1 extra byte for the 8-bit displacement part of the addressing mode. On RISC-like machines, e.g. ARM, fixed-width instructions mean that load/store instructions always have some bits for a displacement (which can simply be all zero to access the first member given a pointer to the start of the object).


Group the hottest members together at the front of the class, preferably sorted by size to avoid gaps for padding (How do I organize members in a struct to waste the least space on alignment?) Hopefully the hot members will all be in the same cache line. (If your class/struct extends into a 2nd cache line, hopefully only the first line has to stay hot in cache most of the time, reducing the footprint of your working set.)

If the member isn't in the same page as the start of the object, Sandybridge-family's pointer-chasing optimization can cause extra latency if this was also loaded from memory. Is there a penalty when base+offset is in a different page than the base? Normally it reduces the L1d load-use latency from 5 to 4 cycles for addressing modes like [rdi + 0..2047] by optimistically using just the register value as an input to the TLB, but has to retry if it guessed wrong. (Not a pipeline flush, just retrying that load uop without the shortcut.)


Note that function-pointers mostly depend on branch prediction to be efficient, with access latency only mattering to check the prediction (and start branch recovery if it was wrong). i.e. speculative execution + branch prediction hides the latency control dependencies in CPUs with out-of-order exec.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Amazing answer! If I understand it correctly, the cache line doesn't have to align with my struct, so if I had 2 pointers in my struct, accessing the 2nd one might already cause a cache miss, is that correct? If so, wouldn't increasing the alignment of my struct guarantee, bigger parts of my struct are in the same cache line? – Jack Sabbath Mar 23 '20 at 15:27
  • 1
    @JackSabbath: Right. On a 64-bit machine, dynamic allocations are usually at least 16-byte aligned so the first 2 members tend to be aligned. But in static and automatic storage it might only be aligned by its minimum `alignof()` which is probably 8 unless you give the first member more alignment. (Don't do that if it's virtual, though; that would lead to padding between the vtable pointer and the first member). It can be inconvenient to guarantee more than `alignof(max_align_t)` even in C++17 so **I wouldn't do anything complicated unless profiling actually shows lots of cache misses.** – Peter Cordes Mar 23 '20 at 20:53
2

The order of class members may affect performance, but usually not due to the offset. Because as mentioned above, almost all architectures have load/store with offset. For small structs then you need 1 more byte on x86 and 0 more byte on fixed-width ISAs (but even with that extra byte the x86 instruction is still usually shorter than the fixed 4-byte instructions in those ISAs). If the struct is huge then you may need 4 more bytes for the 4-byte displacement in x86-64, but the instruction count is still 1. On fixed-width ISAs you'll need at least another instruction to get the 32-bit displacement, yet the offset calculation cost is just tiny compared to the effect of cache misses which is the main thing that may incur performance degradation when changing member positions.

So the order of class members affects the fields' positions in cache, and you'll want the important members to be in cache and in the same cache line. Typically you'll put the largest hot member at the beginning to avoid padding. But if the hottest members are small it may be better to move them to the beginning if they don't cause padding. For example

struct mystruct
{
    uint32_t extremely_hot;
    uint8_t  very_hot[4];
    void* ptr;
}

If ptr isn't accessed very often it may be a better idea to keep it after the hotter fields like that

But moving the fields around isn't always the better solution. In many cases you may consider splitting the class into two, one for hot members and one for cold ones. In fact I've read somewhere that the Intel compiler has a feature that will automatically split hot and cold members of a class into separate classes when running profile-guided optimization. Unfortunately I couldn't find the source right now

Take a simple example

struct game_player
{
    int32_t id;
    int16_t positionX;
    int16_t positionY;
    int16_t health;
    int16_t attribute;
    game_player* spouse;
    time_t join_time;
};
game_player[MAX_PLAYERS];

Only the first 5 fields are commonly used when rendering the object on screen, so we can split them into a hot class

struct game_player_hot
{
    int32_t id;
    int16_t positionX;
    int16_t positionY;
    int16_t health;
    int16_t attribute;
};
struct game_player_cold
{
    game_player* spouse;
    time_t join_time;
};
game_player_hot players_hot[MAX_PLAYERS];
game_player_cold players_cold[MAX_PLAYERS];

Sometimes it's recommended to use SoA (struct of arrays) instead of AoS (array of structs) or a mix of that if the same field of different objects are accessed at the same time or in the same manner. For example if we have a list of vectors to sum, instead of

struct my_vector
{
    uint16_t x, y, z;
    uint16_t attribute; // very rarely used
}
my_vector vectors[MAX];

we'll use

struct my_vector
{
    uint16_t x[MAX];    // hot
    uint16_t y[MAX];    // hot
    uint16_t z[MAX];    // hot
    uint16_t attribute[MAX];
}

That way all the dimension values are kept hot and close to each other. Now we also have easier and better vectorization, and it also keeps the hot things hot.

For more information read

phuclv
  • 37,963
  • 15
  • 156
  • 475