Following up on an earlier question of mine. I'm writing, testing and benchmarking code on a MacBook Air with the M1 CPU running macOS 13.2.
I implemented the code generation approach I suggested in my question and got all tests working, compared to a "conventional" (no code generation) approach to the same problem. As usual, I had to enable writes to executable pages using pthread_jit_write_protect_np(0)
prior to generating the code, followed by write-protecting the pages again using pthread_jit_write_protect_np(1)
, and then call sys_icache_invalidate()
prior to running the generated code, due to cache coherency issues between the L1 I- and D-caches.
If I run the full code with the call to sys_icache_invalidate()
commented out, it takes a few hundreds of nanoseconds, which is quite competitive with the conventional approach. This is after a few straightforward optimizations, and after working on it more, I am certain I'd be able to beat the conventional approach.
However, the code of course doesn't work with sys_icache_invalidate()
commented out. Once I add it back and benchmark, it's adding almost 3 µs to the execution time. This makes the codegen approach hopelessly slower than the conventional approach.
Looking at Apple's code for sys_cache_invalidate()
, it seems simple enough: for each cache line with the starting address in a register xN
, it runs ic ivau, xN
. Afterwards, it runs dsb ish
and isb
. It occurred to me that I could run ic ivau, xN
after each cache line is generated in my codegen function, and then dsb ish
and isb
at the end. My thought is that perhaps each ic ivau, xN
instruction could run in parallel with the rest of the codegen.
Unfortunately, the code still failed, and moreover, it only shaved a couple hundred ns from the execution time. I then decided to add a call to pthread_jit_write_protect_np(1)
before each ic ivau, xN
followed by a call to pthread_jit_write_protect_np(0)
, which finally fixed the code. At this point, it added a further 5 µs to the execution time, which renders the approach completely unfeasible. Scratch that, I made a mistake and even with the calls to pthread_jit_write_protect_np()
, I simply can't get it work unless I call Apple's sys_icache_invalidate()
.
While I've made peace with the fact that I will need to abandon the codegen approach, I just wanted to be sure:
- Does
ic ivau, xN
"block" the M1, i.e. prevents other instructions from executing in parallel, or perhaps flushes the pipeline? - Does
ic ivau, xN
really not work if the page is writeable? Or perhapspthread_jit_write_protect_np()
is doing some other black magic under the hood, unrelated to its main task of write-protecting the page, that I could also do without actually write-protecting the page? For reference, here is the source to Apple's pthread library, but it essentially callsos_thread_self_restrict_rwx_to_rx()
oros_thread_self_restrict_rwx_to_rw()
, which I assume are Apple-internal functions whose source I was unable to locate. - Is there some other approach to cache line invalidation to reduce this overhead?