How does Intel TBB's scalable_allocator work?

Question

What does the tbb::scalable_allocator in Intel Threading Building Blocks actually do under the hood ?

It can certainly be effective. I've just used it to take 25% off an apps' execution time (and see an increase in CPU utilization from ~200% to 350% on a 4-core system) by changing a single std::vector<T> to std::vector<T,tbb::scalable_allocator<T> >. On the other hand in another app I've seen it double an already large memory consumption and send things to swap city.

Intel's own documentation doesn't give a lot away (e.g a short section at the end of this FAQ). Can anyone tell me what tricks it uses before I go and dig into its code myself ?

UPDATE: Just using TBB 3.0 for the first time, and seen my best speedup from scalable_allocator yet. Changing a single vector<int> to a vector<int,scalable_allocator<int> > reduced the runtime of something from 85s to 35s (Debian Lenny, Core2, with TBB 3.0 from testing).

score 21 · Accepted Answer · edited Jun 19 '16 at 01:34

21

There is a good paper on the allocator: The Foundations for Scalable Multi-core Software in Intel Threading Building Blocks

My limited experience: I overloaded the global new/delete with the tbb::scalable_allocator for my AI application. But there was little change in the time profile. I didn't compare the memory usage though.

edited Jun 19 '16 at 01:34

Peter VARGA

4,780
3
39
75

answered Mar 19 '09 at 06:22

amit kumar

20,438
23
90
126

2

Thanks! Article contains exactly the sort of information I was looking for. – timday Mar 19 '09 at 09:17
3

The original link is now defunct, but CiteSeer has the PDF: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.71.8289 – Arto Bendiken Apr 04 '13 at 01:04
4

To add a datapoint: in my particular app, allocator contention halted speedup at around 15 threads, past that it would kill all speedup and by 40 it would be much slower than single-thread. With `scalable_allocator` used in the inner per-thread kernels the bottleneck disappeared and expected scaling came back. (machine has 40 physical cores). – Adam May 04 '14 at 06:39

score 3 · Answer 2 · answered Nov 05 '17 at 15:03

The solution you mentioned is optimized for Intel CPUs. It incorporates specific CPU mechanisms to improve performance.

Sometime ago I found another very useful solution: Fast C++11 allocator for STL containers. It slightly speeds up STL containers on VS2017 (~5x) as well as on GCC (~7x). It uses memory pool for elements allocation which makes it extremely effective for all platofrms.

How does Intel TBB's scalable_allocator work?

2 Answers2

Linked