How to synchronize on ARM when one thread is writing code which the other thread may be executing concurrently?

Question

Consider a multi-core ARM processor. One thread is modifying a machine code block which is maybe being executed concurrently by another thread. The modifying thread does the following kinds of changes:

Mark the machine code block for skipping: it places a jump instruction as the first instruction of the code block, so that whoever executes it should skip the rest of instructions, jumping over the whole code block.
Mark the machine code block for execution: it writes the rest of instructions starting from the second, then it atomically replaces the first instruction (the jump) with the intended first instruction of the code block.

For the code writer thread I understand that it is enough to make the final write with std::memory_order_release in C++11.

However it is not clear what to do on the executor thread side (it's out of control, we just control the machine code block we write). Shall we write some instruction barrier before the first instruction of the code block being modified?

why are you self-modifying code? or modifying your partners code? why doesnt this use any one of the normal locking solutions that you would use for any other shared resource? — old_timer, Sep 02 '16 at 14:51
@dwelch, it is related to compiler-tooling: some hot instrumentation mechanism. It should be possible to turn instrumentation on and off at run-time, so that instrumentation doesn't make the code slower when it is off. — Serge Rogatch, Sep 02 '16 at 14:53
it is just a shared resource, some common ram used by two threads, normal practices apply. nothing special here. — old_timer, Sep 02 '16 at 14:55
you obviously cannot go into that block and modify the running code while it is running, you dont have enough information about the running logic to be able to do that in a clean manner, so you have to treat it as a shared resource and only touch it when the other resource is not using it. — old_timer, Sep 02 '16 at 14:56
you will have to flush the i-cache; there is no standard way to do this but it depends on the used operating system. E.g. under linux/gcc you can call `__builtin___clear_cache`. — ensc, Sep 02 '16 at 14:57
@dwelch, as I understand, the special thing is that the memory contains code, not data. — Serge Rogatch, Sep 02 '16 at 15:00
The ARM architecture reference manual has a number of relevant sections. See *A3.5.4 Concurrent modification and execution of instructions*, possibly *B2.2.9 Ordering of cache and branch predictor maintenance operations* — EOF, Sep 02 '16 at 15:09
@ensc: even having the consumer threads do that before entering every block isn't enough to make this safe. — Peter Cordes, Sep 03 '16 at 20:36

score 6 · Answer 1 · edited May 23 '17 at 12:16

I don't think your update procedure is safe. Unlike x86, ARM's instruction caches aren't coherent with data caches, according to this self-modifying-code blog post.

The non-jump first instruction could still be cached, so another thread could enter the block. When execution reaches the 2nd i-cache line of the block, maybe that one is re-loaded and sees the partially-modified state.

There's also another problem: an interrupt (or context switch) could lead to an evict/reload of the cache line in a thread that's still in the middle of executing the old version. Rewriting a block of instructions in-place requires you to be sure that execution in all other threads has exited the block after you modify things so that new threads won't enter it. This is an issue even with coherent I-cache (like x86), and even if the block of code fits in a single cache line.

I don't think there's any way to make rewriting in-place both safe and efficient at the same time on ARM.

Without coherent I-caches, you also can't guarantee that other threads will see code changes promptly with this design, without ridiculously expensive things like flushing blocks from L1I cache before running them every time.

With coherent I-cache (x86 style), you could just wait long enough for any possible delay in another thread finishing execution of the old version. Even if the block doesn't do any I/O or system calls, cache misses and context switches are possible. If it's running at realtime priority, especially with interrupts disabled, then worst-cache is just cache misses, i.e. not very long. Otherwise I wouldn't bet on anything less than a timeslice or two (maybe 10ms) being really safe.

These slides have a nice overview of ARM caches, mostly focusing on ARMv8.

I'm actually going to quote another slide (about virtualizing ARM) for this bullet point summary, but I'd recommend reading the ELC2016 slides, not the virtualization slides.

Software needs to be aware of caches in a few cases: Executable code loading / generation

Requires a D-cache clean to Point of Unification + I-cache invalidation

Possible from userspace on ARMv8

Requires a system call on ARMv7

D-cache can be invalidated with or without write-back (so make sure you clean/flush instead of discard!). You can and should trigger this by virtual address (instead of flushing a whole cache at once, and definitely don't use the flush by set/way stuff for this).

If you didn't clean your D-cache before invalidating I-cache, code-fetch could fetch directly from main memory into non-coherent I-cache after missing in L2. (Without allocating a stale line in any unified caches, which MESI would prevent because L1D has the line in Modified state). In any case, cleaning L1D to the PoU is architecturally required, and happens in the non-perf-critical writer thread anyway, so it's probably best just to do it instead of trying to reason whether it's safe not to for a specific ARM microarchitecture. See comments for @Notlikethat's efforts to clear up my confusion on this.

For more on clearing I-cache from user-space, see How clear and invalidate ARM v7 processor cache from User Mode on Linux 2.6.35. GCC's __clear_cache() function, and Linux sys_cacheflush only work on memory regions that were mmapped with PROT_EXEC.

Don't modify in-place: use a new location

Where you were planning to have whole blocks of instrumentation code, put a single indirect jump (or a save/restore of lr and a function-call if you're going to have a branch anyway). Each block has its own jump target variable which can be updated atomically. The key thing here is that the destination for the indirect jump is data, so it's coherent with stores from the writing thread.

Since you update the pointer atomically, consumer threads either jump to the old or new block of code.

Now your problem is making sure that no core has a stale copy of the new location in its i-cache. Given the possibilities of context switches, that includes the current core, if context switches don't totally flush the i-cache.

If you use a large enough ring buffer of locations for new blocks, such that they sit unused for long enough to be evicted, it might be impossible in practice for there to ever be a problem. This sounds incredibly hard to prove, though.

If updates are infrequent compared to how often other threads run these dynamically-modified blocks, it's probably cheap enough to have the publishing thread trigger cache-flushes in other threads after writing a new block, but before updating the indirect-jump pointer to point to it.

Forcing other threads to flush their cache:

Linux 4.3 and later has a membarrier() system call that will run a memory barrier on all other cores in the system (usually with an inter-processor interrupt) before it returns (thus barriering all threads of all processes). See also this blog post describing some use-cases (like user-space RCU) and mprotect() as an alternative.

It doesn't appear to support flushing instruction caches, though. If you're building a custom kernel, you could consider adding support for a new cmd or flag value that means flush instruction caches instead of (or as well as) running a memory barrier. Perhaps the flag value could be a virtual address? This would only work on architectures where an address fits in an int, unless you tweak the system call API to look at the full register width of flag for your new cmd, but only the int value for the existing MEMBARRIER_CMD_SHARED.

Other than hacking membarrier(), you could send signals to the consumer threads, and have their signal handlers flush an appropriate region of i-cache. That's asynchronous, so the producer thread doesn't know when it's safe to reuse the old block.

IDK if munmap()ing it would work, but it's probably more expensive than necessary (because it has to modify page tables and invalidate the relevant TLB entries).

Other strategies

You might be able to do something by publishing a monotonically-increasing sequence number in a shared variable (with release semantics so it's ordered wrt. instruction writes). Then consumer threads check the sequence number against a thread-local highest-seen, and invalidate i-cache if there's new stuff. This could be per-block or global.

This doesn't directly solve the problem of detecting when the last thread running an old block has left it, unless those per-thread highest-seen counters aren't actually thread-local: Still per-thread but the producer thread can look at them. It can scan them for the lowest sequence number in any thread, and if that's higher than the sequence number when a block was unreferenced, it can now be reused. Be careful of false sharing: don't use a global array of unsigned long for it, because you want each thread's private variable to be in a separate cache line with other thread-local stuff.

Another possible technique: if there's only one consumer thread, the producer sets the jump target pointer to point to a block which doesn't change (so doesn't need to be i-cache flushed). That block (which runs in the consumer thread) executes a cache-flush for the appropriate line of i-cache and then modifies the jump-target pointer again, this time to point to the block that should be run every time.

With multiple consumer threads, this gets a bit clunky: maybe each consumer has its own private jump-target pointer and the producer updates all of them?

_"The slides don't make it clear why you need to flush L1D, because it should be coherent with L2"_ - ugh, that's not a good example, as it's from the point of view of a hypervisor with levels of cache the guest doesn't know about, and the guest doing set/way operations (which aren't applicable to an SMP situation). Cleaning "the D-cache" by VA to the PoU (which affects as many levels as it needs to) is all you need to care about here. FWIW, [ELC had a more relevant presentation this year](http://events.linuxfoundation.org/sites/events/files/slides/slides_17.pdf). — Notlikethat, Sep 03 '16 at 23:02
@Notlikethat thanks for the link. But I still don't understand why cleaning "the D-cache" by VA to the PoU has to be done manually / explicitly. If you use appropriate barriers to make sure the stores are committed to L1D (not just the store buffer) before flushing I-cache, then where's the problem? The PoU unified cache is coherent with all other data caches in the whole CPU, right? So when L1I tries to read from it, it must eventually get the data that's currently in a Modified line in L1D (of this or another core). Is explicit cleaning just a way to achieve the barrier within one thread? — Peter Cordes, Sep 04 '16 at 07:22
In this case, where we write the data in one thread, and only flush i-cache + execute the new code in another thread, I think we're extra safe. If the consumer thread uses an acquire load on the pointer or sequence counter, seeing a pointer to a block means the updated block contents are also visible to it (assuming the producer used a release-store in the right order). So code-fetch after an i-cache invalidate should see the update just like a data load would? Or is this where my reasoning has gone wrong, and unified caches might re-fetch from memory to satisfy an i-cache miss? — Peter Cordes, Sep 04 '16 at 07:31
Forgot to say: the one case where I understand having to clean the D$ to the PoU is when there is no unified cache between D$ and memory. In that case, I$ will read from memory, which isn't coherent with D$. The new set of slides again say "Requires prior D$ maintenance to PoU or PoC!" before running `IC IALLU`, again without explaining why. Hmm, I see for `IC IVAU` (invalidate I$ by VA), it doesn't say you need to have cleaned D$. I'm not sure I understand how Shareability Domains fit in, but let's assume the addresses are in a region set to outer-shareable, if that makes sense. — Peter Cordes, Sep 04 '16 at 07:51
Data caches are only coherent with each other _with respect to data accesses_; an L1I miss may look up in a unified L2, but that won't cause L2 to go up and snoop L1D - one of the primary reasons to _have_ non-coherent I-caches (in the uniprocessor situation they evolved from) is the simplicity and power saving of not having to have such a snoop mechanism. — Notlikethat, Sep 04 '16 at 20:15
@Notlikethat: oh, so L2 could fill from memory. I guess the other core sharing the same L2 could also get the old copy into its L1D, unless an acquire-load or barrier prevented that. But actual coherence violation would be prevented by snooping if it wanted to change that line to Modified. Ok, I think this all makes sense, and that explains the need to clean to PoU. TYVM for figuring out that I was missing the point that L2 could fill from memory even when L1D was dirty. :) — Peter Cordes, Sep 04 '16 at 20:24
That doesn't happen on modern Intel designs, because the shared L3 is inclusive. It's a one-stop shop for cache coherency: L3 tags tell you if a line is cached (and potentially dirty) anywhere on-chip. This means a lot of coherency traffic stays on-chip instead of going to memory and back. Anyway, this special-case of cache design is why I was thinking L2 would notice L1D, and that once a store was committed to L1D, all other cores see that data. I still think that's true on Intel HW, but I now realize it's not because of MESI alone, but rather because of how it's implemented. — Peter Cordes, Sep 04 '16 at 20:34
...or it may not even fill at all - e.g. on Cortex-A7 _all_ linefetches allocate directly into the respective L1, while the unified L2 only allocates evictions from L1D. Caches be crazy. — Notlikethat, Sep 04 '16 at 20:38
@Notlikethat: I think that last point (L1I fetching directly from memory, using L2 only as a victim cache) is the real explanation. I just re-checked the [MESI article](https://en.wikipedia.org/wiki/MESI_protocol) and my reasoning wasn't Intel-centric after all: If L1D has a Modified line, all other coherent caches (including L2) aren't allowed to hold a copy of the line. So L2 wouldn't be allowed to allocate a stale copy from memory; only a non-coherent cache like an L1I could do that. But it will check for an L2 hit before going to memory, and that's where it can pick up the modified line. — Peter Cordes, Sep 04 '16 at 22:06
At least, assuming ARM caches actually follow some variation of the MESI rules. — Peter Cordes, Sep 04 '16 at 22:06

How to synchronize on ARM when one thread is writing code which the other thread may be executing concurrently?

1 Answers1

Don't modify in-place: use a new location

Other strategies

Linked