I'm attempting to create a std::vector<std::set<int>>
with one set for each NUMA-node, containing the thread-ids obtained using omp_get_thread_num()
.
Idea:
- Create data which is larger than L3 cache,
- set first touch using thread 0,
- perform multiple experiments to determine the minimum access time of each thread,
- extract the threads into nodes based on sorted access times and information about the topology.
Code: (Intel compiler, OpenMP)
// create data which will be shared by multiple threads
const auto part_size = std::size_t{50 * 1024 * 1024 / sizeof(double)}; // 50 MB
const auto size = 2 * part_size;
auto container = std::unique_ptr<double>(new double[size]);
// open a parallel section
auto thread_count = 0;
auto thread_id_min_duration = std::multimap<double, int>{};
#ifdef DECIDE_THREAD_COUNT
#pragma omp parallel num_threads(std::thread::hardware_concurrency())
#else
#pragma omp parallel
#endif
{
// perform first touch using thread 0
const auto thread_id = omp_get_thread_num();
if (thread_id == 0)
{
thread_count = omp_get_num_threads();
for (auto index = std::size_t{}; index < size; ++index)
{
container.get()[index] = static_cast<double>(std::rand() % 10 + 1);
}
}
#pragma omp barrier
// access the data using all threads individually
#pragma omp for schedule(static, 1)
for (auto thread_counter = std::size_t{}; thread_counter < thread_count; ++thread_counter)
{
// calculate the minimum access time of this thread
auto this_thread_min_duration = std::numeric_limits<double>::max();
for (auto experiment_counter = std::size_t{}; experiment_counter < 250; ++experiment_counter)
{
const auto* data = experiment_counter % 2 == 0 ? container.get() : container.get() + part_size;
const auto start_timestamp = omp_get_wtime();
for (auto index = std::size_t{}; index < part_size; ++index)
{
static volatile auto exceedingly_interesting_value_wink_wink = data[index];
}
const auto end_timestamp = omp_get_wtime();
const auto duration = end_timestamp - start_timestamp;
if (duration < this_thread_min_duration)
{
this_thread_min_duration = duration;
}
}
#pragma omp critical
{
thread_id_min_duration.insert(std::make_pair(this_thread_min_duration, thread_id));
}
}
} // #pragma omp parallel
Not shown here is code which outputs the minimum access times sorted into the multimap.
Env. and Output
- How do
OMP_PLACES
andOMP_PROC_BIND
work?
I am attempting to not use SMT by using export OMP_PLACES=cores OMP_PROC_BIND=spread OMP_NUM_THREADS=24
. However, I'm getting this output:
What's puzzling me is that I'm having the same access times on all threads. Since I'm trying to spread them across the 2 NUMA nodes, I expect to neatly see 12 threads with access time, say, x
and another 12 with access time ~2x
.
- Why is the above happening?
Additional Information
Even more puzzling are the following environments and their outputs:
export OMP_PLACES=cores OMP_PROC_BIND=spread OMP_NUM_THREADS=26
export OMP_PLACES=cores OMP_PROC_BIND=spread OMP_NUM_THREADS=48
Any help in understanding this phenomenon would be much appreciated.