sys_icache_invalidate() slow on M1

Question

Following up on an earlier question of mine. I'm writing, testing and benchmarking code on a MacBook Air with the M1 CPU running macOS 13.2.

I implemented the code generation approach I suggested in my question and got all tests working, compared to a "conventional" (no code generation) approach to the same problem. As usual, I had to enable writes to executable pages using pthread_jit_write_protect_np(0) prior to generating the code, followed by write-protecting the pages again using pthread_jit_write_protect_np(1), and then call sys_icache_invalidate() prior to running the generated code, due to cache coherency issues between the L1 I- and D-caches.

If I run the full code with the call to sys_icache_invalidate() commented out, it takes a few hundreds of nanoseconds, which is quite competitive with the conventional approach. This is after a few straightforward optimizations, and after working on it more, I am certain I'd be able to beat the conventional approach.

However, the code of course doesn't work with sys_icache_invalidate() commented out. Once I add it back and benchmark, it's adding almost 3 µs to the execution time. This makes the codegen approach hopelessly slower than the conventional approach.

Looking at Apple's code for sys_cache_invalidate(), it seems simple enough: for each cache line with the starting address in a register xN, it runs ic ivau, xN. Afterwards, it runs dsb ish and isb. It occurred to me that I could run ic ivau, xN after each cache line is generated in my codegen function, and then dsb ish and isb at the end. My thought is that perhaps each ic ivau, xN instruction could run in parallel with the rest of the codegen.

Unfortunately, the code still failed, and moreover, it only shaved a couple hundred ns from the execution time. I then decided to add a call to pthread_jit_write_protect_np(1) before each ic ivau, xN followed by a call to pthread_jit_write_protect_np(0), which ~~finally fixed the code. At this point, it added a further 5 µs to the execution time, which renders the approach completely unfeasible.~~ Scratch that, I made a mistake and even with the calls to pthread_jit_write_protect_np(), I simply can't get it work unless I call Apple's sys_icache_invalidate().

While I've made peace with the fact that I will need to abandon the codegen approach, I just wanted to be sure:

Does ic ivau, xN "block" the M1, i.e. prevents other instructions from executing in parallel, or perhaps flushes the pipeline?
Does ic ivau, xN really not work if the page is writeable? Or perhaps pthread_jit_write_protect_np() is doing some other black magic under the hood, unrelated to its main task of write-protecting the page, that I could also do without actually write-protecting the page? For reference, here is the source to Apple's pthread library, but it essentially calls os_thread_self_restrict_rwx_to_rx() or os_thread_self_restrict_rwx_to_rw(), which I assume are Apple-internal functions whose source I was unable to locate.
Is there some other approach to cache line invalidation to reduce this overhead?

Related: https://stackoverflow.com/questions/70635862/synchronizing-caches-for-jit-self-modifying-code-on-arm/70684882#70684882 — Nate Eldredge, Feb 06 '23 at 06:21
You have to flush ("clean") the data cache, and `dsb ish` to wait for it to finish, before invalidating the instruction cache. Did you do that? — Nate Eldredge, Feb 06 '23 at 06:22
As detailed in my answer I linked above, the sequence is: (1) write your code to memory (2) `dc cvau` every cache line, which queues up a flush of data cache (3) `dsb ish` to wait for all the queued flushes to finish (4) `ic ivau` every cache line, which queues up an invalidate of instruction cache (5) `dsb ish` again to wait for the invalidate to finish (6) `isb` (7) branch to code. Check that you have all those steps. — Nate Eldredge, Feb 06 '23 at 06:37
Apple's code seems to skip the `dc cvau`. The "we are fully coherent" comment suggests that maybe their chips provide stronger cache coherence than the ARMv8 spec requires. But there's still a `dsb ish` after the writes are all done and before the `ic ivau` loop begins, which you do not mention having in your version. — Nate Eldredge, Feb 06 '23 at 06:43

score 0 · Answer 1 · answered Feb 05 '23 at 00:52

Sorry, this sounds like really badly formulated questions to me, I'm not sure that I got them right, but will try to answer.

Does ic ivau, xN "block" the M1, i.e. prevents other instructions from executing in parallel, or perhaps flushes the pipeline?

'parallel' does not make sense to me in this context. Is it blocking other CPU ? or other instruction on same CPU ?

ARMv8 supports processors: 'in-order' (Cortex-A53) and 'out-of-order' (Cortex-A57). In both cases 'in-order' and 'out-of-order' CPU-s have internal multi-stage pipelines. Pipeline, in this context, means that a few instructions are executed 'in parallel', eg at same time. More precisely execution of inst2 could be started before inst1 is completed (though I'm not sure that question means 'parallel' of this kind).

Also there is difference in issuing command for execution, and completing command. For example issuing ic ivau start invalidation of cache, but it does not block pipeline until completed. Sync for completion is organized by isb barrier instruction.

"Ordering and completion of data and instruction cache instructions" section in ARMv8 reference manual describes all cache related ordering in details.

So considering all above, answers for original questions:

prevents other instructions from executing in parallel
No
or perhaps flushes the pipeline
No

^^ Disclaimer, all above is true for generic ARMv8 CPU (also sometimes referred as 'arm64'). M1 might have it's own HW - bugs/implementation specific that would affect execution.

Does ic ivau, xN really not work if the page is writeable?

No, ic instruction itself is not affected by memory page attributes.

Is there some other approach to cache line invalidation to reduce this overhead?

if memory block to invalidate is big enough, might be faster to blow whole cache at once instead of looping over memory region.

side note:

However, the code of course doesn't work with sys_icache_invalidate() commented out. Once I add it back and benchmark, it's adding almost 3 µs to the execution time

And why that's surprise you ? cache is about repeated access. This 3microsec comes from necessity to access higher memory level which is slower. However after first execution when instructions/data are fetched back to cache, performance would be back to normal.

Also you mentioning invalidating instruction cache, w/o mentioning flushing data cache anywhere.

Any self-modifying/execution code loading sequence is

write code to memory
flush data cache
invalidate instruction cache
jump to execute new code

`Is it blocking other CPU ? or other instruction on same CPU ?` my code is single-threaded, so my worry is about superscalar execution, i.e. instruction-level parallelism. What I'm seeing is a significant increase in execution time when `ic` instructions are spread throughout the algorithm, compared to having no `ic` instructions at all. The execution time is very similar to running them all at once at the end of the algorithm, by calling `sys_icache_invalidate` at the end. This suggests to me that `ic` is blocking further instructions from executing until it's completed. — swineone, Feb 05 '23 at 13:10
`And why that's surprise you ?` Note that 3 µs is almost 10,000 M1 cycles. My code calls `ic` 130 times, for consecutive cache lines. Even if a single execution unit of the M1 were able to execute `ic` (but in a pipelined fashion, so 1 every cycle), I would expect at most a few hundred cycles for all `ic`s to execute. `dsb` and `isb` should also add at most a few hundred cycles. When running the updated code, it should be fetched from L2, so it shouldn't add considerable latency to its execution. Overall it feels to me this is at least 10x slower than it should be. — swineone, Feb 05 '23 at 13:21
Oh, and something I forgot to mention: the code that is generated (and goes through cache invalidation, etc.) is run only once. Thus it makes no difference to me whether a second, third, etc. execution of it would be fast, as I run it only once. Sure might seem crazy to generate code that is run only once, but the point is: not accounting for the slowdown due to cache invalidation, this approach is faster than the best known algorithm to solve the problem I'm working on. — swineone, Feb 05 '23 at 13:26
Alright, that's a bit more clear I think. Are you doing `ic ivau, xN` with same register ? like using `x0` register only in a loop (eg `ic ivau, x0`) for invalidating. Using same register might prevent reordering due to same register dependancy. You might try to partially unroll loop and use different regs. Something like `for (...) { x0 = addr, x1 = addr + 64, x2 = addr + 128, x3 = 192; ic x0, ic x1, ic x2, ic x3; addr += 256 }` — user3124812, Feb 06 '23 at 00:30
I just tried a microbenchmark with unroll factors 4, 8 and 16, clearing a fixed 8 KB of the L1 I-cache in all cases. Running this routine just once leads to immense variability, so I have an outer loop that runs the routine a few hundred times. The resulting cycle count is divided by the number of iterations of this outer loop. Although there are still some outliers, in about 90% of the runs it takes 846-847 cycles, regardless of the unroll factor. So I don't think using the same register is the issue. — swineone, Feb 06 '23 at 01:18
wait, just read your comments again. Are you calling `sys_icache_invalidate()` or directly call `ic ivau` instruction? — user3124812, Feb 06 '23 at 08:38

sys_icache_invalidate() slow on M1

1 Answers1