21

Chandler Carruth introduced two functions in his CppCon2015 talk that can be used to do some fine-grained inhibition of the optimizer. They are useful to write micro-benchmarks that the optimizer won't simply nuke into meaninglessness.

void clobber() {
  asm volatile("" : : : "memory");
}

void escape(void* p) {
  asm volatile("" : : "g"(p) : "memory");
}    

These use inline assembly statements to change the assumptions of the optimizer.

The assembly statement in clobber states that the assembly code in it can read and write anywhere in memory. The actual assembly code is empty, but the optimizer won't look into it because it's asm volatile. It believes it when we tell it the code might read and write everywhere in memory. This effectively prevents the optimizer from reordering or discarding memory writes prior to the call to clobber, and forces memory reads after the call to clobber†.

The one in escape, additionally makes the pointer p visible to the assembly block. Again, because the optimizer won't look into the actual inline assembly code that code can be empty, and the optimizer will still assume that the block uses the address pointed by the pointer p. This effectively forces whatever p points to be in memory and not not in a register, because the assembly block might perform a read from that address.

(This is important because the clobber function won't force reads nor writes for anything that the compilers decides to put in a register, since the assembly statement in clobber doesn't state that anything in particular must be visible to the assembly.)

All of this happens without any additional code being generated directly by these "barriers". They are purely compile-time artifacts.

These use language extensions supported in GCC and in Clang, though. Is there a way to have similar behaviour when using MSVC?


† To understand why the optimizer has to think this way, imagine if the assembly block were a loop adding 1 to every byte in memory.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
R. Martinho Fernandes
  • 228,013
  • 71
  • 433
  • 510
  • 1
    It looks [like](http://stackoverflow.com/a/8845503/786653) `_ReadWriteBarrier` might be the answer for `clobber`. I don't know about `escape` though. Maybe `_ReadWriteBarrier` plus handing off the pointer to some externally defined function. – user786653 Jul 07 '16 at 16:40
  • Oh, I forgot to mention another feature of these: they generate no code whatsoever. Any effect they have disappears after the optimizer is done. Nothing sticks until runtime. They're purely compile-time. – R. Martinho Fernandes Jul 07 '16 at 16:43
  • Like @user786653 said, `_ReadWriteBarrier` (or perhaps just `_ReadBarrier`/`_WriteBarrier` if that's all that is needed) will have the same effect in MSVC as `clobber`. For `escape`, my experience in analyzing assembly output is that MSVC will do the right thing if you just mark the variable `volatile`. Of course, there is some runtime overhead in that, because the generated code will *always* keep the variable updated in memory. It's not a perfect solution, but I haven't found anything better. – Cody Gray - on strike Jul 07 '16 at 16:44
  • Just as the compiler is allowed to reorder instructions for optimal performance, the CPU may perform instruction reordering - internally. To properly inform the processor of these fences, across which it may not migrate loads, stores or both operations, you need instructions in the code. Please research std::thread and fences in general. As the mechanism you describe only impacts the compiler ("no code whatsoever") and not the CPU, beware there be dragons. – David Thomas Jul 07 '16 at 16:47
  • @Cody `escape` works with arbitrary pointers, though, unlike volatile (e.g. `escape(somevector.data())`) – R. Martinho Fernandes Jul 07 '16 at 16:47
  • @David CPU reordering is fine: the real code runs with CPU reordering as well. What's not fine is the compiler taking away the things you care about and leaving you timing a whole bunch of nothing. This is meant to actually generate the loads/writes/etc in the first place. Threads are irrelevant here. – R. Martinho Fernandes Jul 07 '16 at 16:49
  • 1
    @R.MartinhoFernandes ... as long as you are using this mechanism for benchmarking, but not relying on it to prevent migration of reads/writes for threaded code, I [quiesce](http://www.thefreedictionary.com/Quiesce). – David Thomas Jul 07 '16 at 16:53
  • You should also be aware that "all memory" may not mean quite what you intend either. For instance local variables whose pointers haven't escaped might not be affected. – David Wohlferd Jul 07 '16 at 22:14
  • @DavidWohlferd I mentioned that in the post. – R. Martinho Fernandes Jul 08 '16 at 08:24
  • 1
    Have you researched std::atomic_signal_fence and atomic_thread_fence? – David Thomas Jul 11 '16 at 05:37

2 Answers2

6

Given your approximation of escape(), you should also be fine with the following approximation of clobber() (note that this is a draft idea, deferring some of the solution to the implementation of the function nextLocationToClobber()):

// always returns false, but in an undeducible way
bool isClobberingEnabled();

// The challenge is to implement this function in a way,
// that will make even the smartest optimizer believe that
// it can deliver a valid pointer pointing anywhere in the heap,
// stack or the static memory.
volatile char* nextLocationToClobber();

const bool clobberingIsEnabled = isClobberingEnabled();
volatile char* clobberingPtr;

inline void clobber() {
    if ( clobberingIsEnabled ) {
        // This will never be executed, but the compiler
        // cannot know about it.
        clobberingPtr = nextLocationToClobber();
        *clobberingPtr = *clobberingPtr;
    }
}

UPDATE

Question: How would you ensure that isClobberingEnabled returns false "in an undeducible way"? Certainly it would be trivial to place the definition in another translation unit, but the minute you enable LTCG, that strategy is defeated. What did you have in mind?

Answer: We can take advantage of a hard-to-prove property from the number theory, for example, Fermat's Last Theorem:

bool undeducible_false() {
    // It took mathematicians more than 3 centuries to prove Fermat's
    // last theorem in its most general form. Hardly that knowledge
    // has been put into compilers (or the compiler will try hard
    // enough to check all one million possible combinations below).

    // Caveat: avoid integer overflow (Fermat's theorem
    //         doesn't hold for modulo arithmetic)
    std::uint32_t a = std::clock() % 100 + 1;
    std::uint32_t b = std::rand() % 100 + 1;
    std::uint32_t c = reinterpret_cast<std::uintptr_t>(&a) % 100 + 1;

    return a*a*a + b*b*b == c*c*c;
}
Community
  • 1
  • 1
Leon
  • 31,443
  • 4
  • 72
  • 97
  • @Peter note that `isClobberingEnabled` is called only once (it's used in namespace scope). However, maybe your point still applies to `nextLocationToClobber`. – R. Martinho Fernandes Jul 08 '16 at 16:27
  • 1
    @R.MartinhoFernandes: Just noticed that and deleted my comment. Reposting a correct version: The `call` to `nextLocationToClobber` means the compiler can't treat the function containing it as a leaf function. Hopefully spilling of call-clobbered registers would be limited to the branch where the call happens, and not have too much impact on the not-taken side, but it's still non-zero impact. It will compile to a test&branch on a global, at least. So there's a non-zero amount of code generated, unlike for gcc. :/ Still, a predictable branch is cheap. – Peter Cordes Jul 08 '16 at 16:27
  • This might be the best you can do with MSVC, but that would be disappointing if it didn't have any builtin / intrinsic functions that can help. – Peter Cordes Jul 08 '16 at 16:34
  • How would you ensure that `isClobberingEnabled` returns false "in an undeducible way"? Certainly it would be trivial to place the definition in another translation unit, but the minute you enable LTCG, that strategy is defeated. What did you have in mind? – Cody Gray - on strike Jul 09 '16 at 05:53
1

I have used the following in place of escape.

#ifdef _MSC_VER
#pragma optimize("", off)
template <typename T>
inline void escape(T* p) {
    *reinterpret_cast<char volatile*>(p) =
        *reinterpret_cast<char const volatile*>(p); // thanks, @milleniumbug
}
#pragma optimize("", on)
#endif

It's not perfect but it's close enough, I think.

Sadly, I don't have a way to emulate clobber.

R. Martinho Fernandes
  • 228,013
  • 71
  • 433
  • 510