I am trying to run a simple task (obtaining the x2APIC ID of the current processor) on every available hardware thread. I wrote the following code to do this, which works on the machines that I tested it on (see here for the complete MWE, compilable on Linux as C++11 code).
void print_x2apic_id()
{
uint32_t r1, r2, r3, r4;
std::tie(r1, r2, r3, r4) = cpuid(11, 0);
std::cout << r4 << std::endl;
}
int main()
{
const auto _ = std::ignore;
auto nprocs = ::sysconf(_SC_NPROCESSORS_ONLN);
auto set = ::cpu_set_t{};
std::cout << "Processors online: " << nprocs << std::endl;
for (auto i = 0; i != nprocs; ++i) {
CPU_SET(i, &set);
check(::sched_setaffinity(0, sizeof(::cpu_set_t), &set));
CPU_CLR(i, &set);
print_x2apic_id();
}
}
Output on one machine (when compiled with g++, version 4.9.0):
0
2
4
6
32
34
36
38
Each iteration printed a different x2APIC ID, so things are working as expected. Now is where the problems start. I replaced the call to print_x2apic_id
with the following code:
uint32_t r4;
std::tie(_, _, _, r4) = cpuid(11, 0);
std::cout << r4 << std::endl;
This causes the same ID to be printed for each iteration of the loop:
36
36
36
36
36
36
36
36
My guess as to what happened is that the compiler noticed that the call to cpuid
does not depend on the loop iteration (even though it really does). The compiler then "optimized" the code by hoisting the call to CPUID outside the loop. To try to fix this, I converted r4
into an atomic:
std::atomic<uint32_t> r4;
std::tie(_, _, _, r4) = cpuid(11, 0);
std::cout << r4 << std::endl;
This failed to fix the problem. Surprisingly, this does fix the problem:
std::atomic<uint32_t> r1;
uint32_t r2, r3, r4;
std::tie(r1, r2, r3, r4) = cpuid(11, 0);
std::cout << r4 << std::endl;
... Ok, now I'm confused.
Edit: Replacing the asm
statement in the cpuid
function with asm volatile
also fixes the issue, but I don't see how this should be necessary.
My Questions
- Shouldn't inserting an acquire fence before the call to
cpuid
and a release fence after the call to CPUID be sufficient to prevent the compiler from performing the memory reordering? - Why didn't converting
r4
tostd::atomic<uint32_t>
work? And why did storing the first three outputs intor1
,r2
, andr3
instead of ignoring them cause the program to work? - How can I correctly write the loop, using the least amount of synchronization necessary?