3

I am trying to run a simple task (obtaining the x2APIC ID of the current processor) on every available hardware thread. I wrote the following code to do this, which works on the machines that I tested it on (see here for the complete MWE, compilable on Linux as C++11 code).

void print_x2apic_id()
{
        uint32_t r1, r2, r3, r4;
        std::tie(r1, r2, r3, r4) = cpuid(11, 0);
        std::cout << r4 << std::endl;
}

int main()
{
        const auto _ = std::ignore;
        auto nprocs = ::sysconf(_SC_NPROCESSORS_ONLN);
        auto set = ::cpu_set_t{};
        std::cout << "Processors online: " << nprocs << std::endl;

        for (auto i = 0; i != nprocs; ++i) {
                CPU_SET(i, &set);
                check(::sched_setaffinity(0, sizeof(::cpu_set_t), &set));
                CPU_CLR(i, &set);
                print_x2apic_id();
        }
}

Output on one machine (when compiled with g++, version 4.9.0):

0
2
4
6
32
34
36
38

Each iteration printed a different x2APIC ID, so things are working as expected. Now is where the problems start. I replaced the call to print_x2apic_id with the following code:

uint32_t r4;
std::tie(_, _, _, r4) = cpuid(11, 0);
std::cout << r4 << std::endl;

This causes the same ID to be printed for each iteration of the loop:

36
36
36
36
36
36
36
36

My guess as to what happened is that the compiler noticed that the call to cpuid does not depend on the loop iteration (even though it really does). The compiler then "optimized" the code by hoisting the call to CPUID outside the loop. To try to fix this, I converted r4 into an atomic:

std::atomic<uint32_t> r4;
std::tie(_, _, _, r4) = cpuid(11, 0);
std::cout << r4 << std::endl;

This failed to fix the problem. Surprisingly, this does fix the problem:

std::atomic<uint32_t> r1;
uint32_t r2, r3, r4;
std::tie(r1, r2, r3, r4) = cpuid(11, 0);
std::cout << r4 << std::endl;

... Ok, now I'm confused.

Edit: Replacing the asm statement in the cpuid function with asm volatile also fixes the issue, but I don't see how this should be necessary.

My Questions

  1. Shouldn't inserting an acquire fence before the call to cpuid and a release fence after the call to CPUID be sufficient to prevent the compiler from performing the memory reordering?
  2. Why didn't converting r4 to std::atomic<uint32_t> work? And why did storing the first three outputs into r1, r2, and r3 instead of ignoring them cause the program to work?
  3. How can I correctly write the loop, using the least amount of synchronization necessary?
void-pointer
  • 14,247
  • 11
  • 43
  • 61

1 Answers1

3

I've reproduced the problem with optimization enabled (-O). You are right suspecting the compiler optimization. CPUID serves as full memory barrier (for the processor) itself; but it is the compiler which generates the code without calling to cpuid function in the loop since it threats it as a constant function. asm volatile prevents compiler from such an optimization saying that it has side-effects.

See this answer for details: https://stackoverflow.com/a/14449998/2527797

Community
  • 1
  • 1
Anton
  • 6,349
  • 1
  • 25
  • 53
  • I should have mentioned that I tried this already, but the problem still persists. – void-pointer Aug 18 '14 at 17:25
  • Replacing the `asm` statement in the `cpuid` function with `asm volatile` also fixes the issue. But I don't see how this should be necessary. – void-pointer Aug 18 '14 at 17:32
  • Ah! it means that you are rather right about the compiler optimization. I'll try to rework the answer. – Anton Aug 18 '14 at 17:34
  • Thanks, that makes sense. Here's a relevant quote from the answer: "The volatile keyword tells the compiler that it's not allowed to move this assembly block. For example, it may be hoisted out of a loop if the compiler decides that the input values are the same in every invocation. I'm not really sure under what conditions the compiler will decide that it understands enough about the assembly to try to optimize its placement, but the volatile keyword suppresses that entirely." – void-pointer Aug 18 '14 at 18:12
  • Actually, should the compiler be allowed to make this optimization? The values of the atomic variables are still used as parameters to the `cpuid` function. I would assume that the implicit fences would prevent the compiler from hoisting the `cpuid` instruction outside the loop. – void-pointer Aug 18 '14 at 18:14
  • 1
    probably compiler can prove that they are local and are not visible to other threads thus no side-effects. – Anton Aug 18 '14 at 18:28