Data cache implications of using std::make_shared()

Question

I read here that:

make_shared is (in practice) more efficient, because it allocates the reference control block together with the actual object in one single dynamic allocation. By contrast, the constructor for shared_ptr that takes a naked object pointer must allocate another dynamic variable for the reference count

Does it mean that vector of std::shared_ptr created using std::make_shared will be "cache-friendly" as the data (control block and real pointer's data) are in one chunk ?

My use case is a vector of 100 000 shared pointers where object pointed to is 14 bytes.

It's not clear that will be more cache friendly. If you frequently operate only on the shared pointers and not on the objects they point to (for example, duplicating the vector and thus incrementing the reference count on each shared pointer), the separate allocations will be more cache friendly. — David Schwartz, Oct 11 '12 at 11:53
@DavidSchwartz Or not. If `shared_ptr` is using a pool allocator for the counters, then there could definitely be a cache win in algorithms manipulating only the pointers. If it doesn't, depending on the algorithms used by the global allocator, you might end up with exactly the same memory ordering as with `make_shared`, just spread out a bit, because the allocator will need extra management information for more separate allocations. — James Kanze, Oct 11 '12 at 12:13
@James: do you know, what kind of pool allocator `shared_ptr` implementations use in practice? Doesn't the size of a control block depend on the type of the deleter, so it's not quite as simple as using a fixed-size for everything? — Steve Jessop, Oct 11 '12 at 13:00
@SteveJessop I've never looked. You're right about the impact of the deleter, but it would be fairly simple to use the pool when the default deleter is used, and the usual `new` otherwise. — James Kanze, Oct 11 '12 at 15:05

score 1 · Answer 1 · edited Oct 14 '16 at 07:03

1

It is impossible to make a vector of shared pointers created with make_shared. Try it, you cannot do it. The best you can do is copy construct or copy assign the pointers in the vector from shared pointers made with make_shared. But then they will be somewhere else in memory.

However, the control blocks will still be near the object. When you call make_shared, you actually make three things: an object, a shared pointer control block to track the references to the object, and a shared pointer. The make_shared function causes the control block and the object itself to be allocated in a single contiguous memory block.

Whether that's cache friendly or not is an interesting question. Basically, it depends how you use the object.

If you frequently operate only on the shared pointers and not on the objects they point to (for example, duplicating the vector and thus incrementing the reference count on each shared pointer), then separate allocations will probably be more cache friendly, not the combined ones that make_share gives you.

If you frequently operate on the objects themselves every time you operate on the shared pointers, then make_shared should be more cache friendly under typical circumstances.

edited Oct 14 '16 at 07:03

eepp

7,255
1
38
56

answered Oct 11 '12 at 11:54

David Schwartz

179,497
17
214
278

Do you mean to say that it's impossible to put the control blocks and objects in the vector? – Luc Danton Oct 11 '12 at 12:20
@LucDanton: The vector will contain only shared pointers which point to the control blocks and objects. The objects and control blocks will be allocated wherever. The shared pointers in the vector can only be copy assigned from other shared pointers because a vector has to be in contiguous memory. – David Schwartz Oct 11 '12 at 12:27
The wording in your first paragraph is problematic in that it suggests there is a problem with `std::make_shared`, `std::vector`, or their use in tandem -- when there isn't. – Luc Danton Oct 11 '12 at 12:32
There's no problem, it just doesn't create a vector of shared pointers created with `make_shared`. Doing that is impossible. – David Schwartz Oct 11 '12 at 12:34
2

Putting it another way, it's impossible to create a vector of *any* type created with *anything* other than `Alloc::construct`, where `Alloc` is the allocator type of the vector. The simple reason being, *that's how vectors create their elements*. `pow` creates `double`s, and you can't make a vector of `double`s created with `pow`. Or a vector of raw pointers created with `new`. You can only make a vector of their values. A somewhat pedantic point, but it's very common for novices to think that vectors contain the object they put in, rather than a copy/move of it. – Steve Jessop Oct 11 '12 at 12:41
@SteveJessop: The reason I brought it up is because it really is important to understand that vectors work this way. – David Schwartz Oct 11 '12 at 13:16
@David: I agree, but literally the comment wasn't long enough for me to add "which is why that point sometimes needs to be made" to the end, so I just dropped it :-) – Steve Jessop Oct 11 '12 at 13:25
Another way the wording is problematic: even if you could put the exact pointers created by `std::make_shared`, whatever that may mean, inside the vector, what would that achieve? – Luc Danton Oct 11 '12 at 13:50
@LucDanton: It would save a construct and a copy and it would result in the call to `make_shared` deciding where in memory the shared pointers were rather than the vector. – David Schwartz Oct 11 '12 at 14:22
I don't get it. When you say 'pointers', do you mean the control block + pointee? If so, calling that the 'pointer' seems like an abuse of language: what does such a pointer point to, after all. – Luc Danton Oct 11 '12 at 14:27
When I say "pointers" I mean the actual shared pointers. When you call `make_sharead`, it actually makes three things. One of them is the shared pointer whose value is returned. That is the "pointer". (The control block and the pointee are also made. They are not the pointer, they're what it points to.) The vector winds up with a copy of this pointer, placed in memory that is managed by the vector. – David Schwartz Oct 11 '12 at 14:28

Steve Jessop · Accepted Answer · 2012-10-11T13:02:53.343

Maybe, but don't count on it.

For cache-friendliness, you want to use as little memory as possible, and you want operations that are close together in address to also be close together in time (that is, close enough that the second operation uses memory that is still in some level of cache from the effects of the first operation: the lower the level of cache the better).

If you use make_shared, then there might well be a slight saving in total memory use, which at least tends to be a win for the cache no matter what your memory usage pattern.

If you use make_shared, then the control block and the object referred to (referand) will be adjacent in memory.

If you don't use make_shared, and your objects are a different size from your control blocks, then with common memory allocators there's a reasonable chance that the objects will be clustered together in one place and the control blocks clustered together in a different place. If they are the same size (once rounded by the memory allocator in some implementation-specific way), then with common memory allocators there's a reasonable chance that they'll just alternate in memory for long runs unless shared_ptr does something to affect that.

Your memory access pattern will determine which of those layouts is better for the cache -- and of course the actual layout you get in the non-make_shared case might be something else again, depending on implementation details.

The fact that you have a vector is basically independent of all this, since the shared_ptr objects are separate from the control-blocks and the referands.

std''OrgnlDave · Answer 3 · 2012-10-11T13:28:01.020

As an above poster mentioned, making an object with make_shared makes the "control block" adjacent to the object referred to.

In your case however I believe this to be a poor choice.

When you allocate memory, even in a big block, you have no guarantee to get contiguous "physical space" as opposed to sparse, fragmented page allocations. For this reason, iterating through your list would cause reads across large spans of memory just to get the control structures (which then point to the data).

"But my cache lines are 64 bytes long!" you say. If this is true, you might think, "this will mean that the object is loaded into cache along with the control structure," but that is not necessarily true. That depends on many things such as data alignment, cache line size, associativity of the cache, and the actual memory bandwidth you use.

The problem you run into is the fact that first the control structure needs to be fetched to figure out where the data is, when instead, that could be residing already in cache, so part of your data (the control structure) could at least be practically guaranteed to be in cache if you allocate them all together instead of with make_shared.

If you want to make your data cache-friendly, you want to make sure that all the references to it fit inside the highest-level cache possible. Continuing to use it will help to make sure it stays in cache. The cache algorithms are sophisticated enough to handle fetching your data unless your code is very branch-heavy. This is the other part of making your data "cache friendly:" use as few branches as possible when working on it.

Also, when working on it, try to break it up into pieces that fit in cache. Only operate on 32k of it at a time if possible - that is a conservative number on modern processors. If you know exactly which CPU you will be running your code on, you can optimize it less conservatively, if you need to.

EDIT: I forgot to mention a pertinent detail. The most frequent allocated page size is 4k. Caches are often "associative," especially in lower-end processors. 2-way associative means that each location in memory can only be mapped to every other cache entry; 4-way associative means it can be fit into any of 4 possible mappings, 8-way means any of 8 possible mappings etc. The higher the associativity the better for you. The fastest cache (L1) on a processor tends to be the least associative since it requires less control logic; having contiguous blocks of data to reference (such as contiguous control structures) is a good thing. Fully associative cache is desirable.

Data cache implications of using std::make_shared()

3 Answers3