Suggestions on solving locking problem with ARM 7 CCL Lisp compiler, raspberry Pi?

Question

Background: CCL aka OpenMCL is a very nice, venerable, lightweight but fairly fast Common Lisp compiler. It's an excellent match for the RPi because it runs on 32-bit models, and isn't too memory intensive. In theory, unlike the heavier SBCL lisp compiler, it supports threads on 32-bit RPi. But it has a long-standing mutex bug.

However, this is an ARM machine language question, not a Lisp question. I'm hoping an ARM expert will read this and have an Aha! moment from a different context.

The problem is that CCL suffers a fatal flaw on the Raspberry Pi 2 and 3 (and probably 4), as well as other ARM boards: when using threading and locking it will fail with memory corruption. The threading failure has been known for years.

I believe I isolated the issue further, to a locking failure: when a CCL thread grabs a lock (mutex) and checks to see if it owns the lock, it sometimes turns out that another thread owns the lock. Threads seem to steal each other locks, which would be fatal for garbage collection, among other things. It seems to me that information that one core has taken control of a lock does not percolate through to the other cores before the other cores grab it themselves (race condition). This bug does not happen on one-core RPis, like the Pi Zero.

I've explored this bug in this GitHub repo. The relevant function is (threadtest2) which spawns threads, performs locks, and checks lock ownership. I initially thought that the locking code might be a missing DMB instruction; DMB "ensures that the exclusive write is synchronized to all processors". Thus I put DMB instructions all over the locking code (but upon looking carefully, DMB was already there in a few spots, so the original compiler author had thought of this).

In detail, I put DMBs into just about every locking routine of arm-misc.lisp called from the futex-free version of %get-spin-lock called by %lock-recursive-lock-ptr in ARM/l0-misc.lisp, with no luck. The low-level function in ARM/l0-misc.lisp is `%ptr-store-fixnum-conditional'. This doesn't use DMB, but uses LDREX/STREX atomic update functions.

[edit] As user coredump points out below, DMB is indeed necessary on multi-cores according to blogs and ARM docs, though there is some disagreement concerning how many places it should appear, after the STREX or also before the LDREX.

Obviously, I'm not asking anyone to diagnose this compiler. My question is

Does this lock-stealing behavior ring a bell? Has anyone else seen this problem of lock-stealing or race-condition on the ARM in another context, and have they found a solution? Is there something I'm missing about DMB, or is there another instruction needed?

As an addendum, here is my annotation of the part where it might be failing, in ARM/lo-misc.lisp, function %ptr-store-fixnum-conditional - this is machine code in Lisp format. I inserted some DMBs as shown following the comment below, and it didn't help.

    ;; this is the function used to grab the mutex, using ldrex/strex
    ;; to set a memory location atomically
    (defarmlapfunction %ptr-store-fixnum-conditional 
       ((ptr arg_x) (expected-oldval arg_y) (newval arg_z))
     (let ((address imm2)  ;; define some variables
           (actual-oldval imm1))
           (macptr-ptr address ptr)
           @again
      (DMB) ;; my new DMB, according to Chen's blog (not ARM manual)
          ;; first, load word from memory with ldrex, 
          ;;    initializing atomic memory operation
          ;;    and claiming control of this memory for this core
          (ldrex actual-oldval (:@ address))
          ;; if the actual-oldval is wrong, then give up on 
          ;;  this pointer store because the lock is taken,
          ;;    (looping higher up in code until free)
          (cmp actual-oldval expected-oldval)
          (bne @done)
          ;; 
          ;; 2nd part of exclusive memory access:
          ;;  store newval into memory and put a flag into imm0
          (strex imm0 newval (:@ address))
          ;; if the exclusive store failed, another core messed 
          ;;    with memory, so loop for another  ldrex/strex cycle 
          (cmp imm0 (:$ 0))
          (bne @again)
     (DMB) ;; my new DMB after conditional jump
          ;; success: the lock was obtained (and exclusive access
          ;;    was cleared by strex)
          (mov arg_z actual-oldval)
          (bx lr)  ;; return to caller in case of good mutex grab
          @done
          ;; clear exclusive access if lock grab failed
          (clrex)
          (mov arg_z actual-oldval)
      (DMB) ;; my new DMB.  Maybe not needed?
          (bx lr)))  ;; return to caller in case of failed mutex grab

Addendum - Once again, I tried put DMB around every LDREX/STREX, and it didn't help. I also tried putting a DMB into every %SET-xxx function, following the ARM docs on releasing mutexes, but this was harder to trace - I couldn't find out where %%set-unsigned-long was defined, after grepping the whole source tree, so I blindly stuffed DMB before every STR instruction inside a %SET-xxx function.

I believe that CCL uses system level futex on other platforms, and does its own custom locking only (?) on the ARM, if that's another clue. Maybe the whole thing could be fixed using the OS supplied futex? Maybe no other system uses custom locks, so the ARM is just the first (multi-core) system to show breakage?

I don't know, but Raymond Chen here (https://devblogs.microsoft.com/oldnewthing/20210614-00/?p=105307) seems to agree that there must be DMB instructions to ensure previous writes are finished; hope you can find out — coredump, Sep 20 '22 at 21:10
Thank you. This inspired me to look further. [This blog](http://zhiyisun.github.io/2016/07/06/Builtin-Atomic-Operation-in-GCC-on-Different-ARM-Arch-Platform.html) seems to use DMB only after STREX. This [stackoverflow](https://stackoverflow.com/questions/54061127/dmb-instructions-in-an-interrupt-safe-fifo) has the DMB in two spots, like Chen. [This ARM page](https://developer.arm.com/documentation/dht0008/a/arm-synchronization-primitives/practical-uses/implementing-a-mutex?lang=en) has the DMB just after STREX. I'll explore this, but I'm sure I tried DMB just about everywhere. — J Klein, Sep 20 '22 at 21:41
Update: I added a DMB to the start and end (following Chen) of all LDREX/STREX routines, and no luck. Sometimes my test code shows a mutex is not owned by its grabber, and sometimes it runs for hours, but then I see it is apparently stuck in one thread (mutex not released?). Will work on it more when I have a chance. Perhaps a DMB might also be needed when the mutex is released, not just when grabbed. — J Klein, Sep 24 '22 at 09:29
I will follow up and try to find more information on the subject. Good luck — Tibic4, Sep 28 '22 at 00:57
I hope this still gets views and ideas, don't hesitate to add more information if you can narrow it down even further. — coredump, Sep 28 '22 at 22:06
Thank you for taking an interest, and for offering a bounty. I was worried I might be the only one who cared. I'm starting to suspect it might not be in the spinlock, but higher up, so when I have the chance I might write some pure-spinlock code. And shut off GC while running it. Another sad thing is that CCL isn't being developed for 64 bit ARM, so all these efforts might be a dead end. — J Klein, Oct 01 '22 at 07:35

score 1 · Answer 1 · answered Sep 27 '22 at 13:27

1

You can try, to see if it helps, to add a DMB instruction before the LDREX instruction and after the STREX instruction. The DMB instruction is a memory barrier instruction, which ensures that the exclusive write is synchronized to all processors. The DMB instruction is described in the ARM Architecture Reference Manual.

answered Sep 27 '22 at 13:27

Tibic4

3,709
1
13

Yes, that's what I did in the addendum to my post. It didn't help. Next on the agenda, when I have a chance, is to strip the locking code down the absolute minimum (just the spinlock without the additional layer of the CCL higher level lock) to verify that the issue really happens in the spinlocking part appearing in the original post). – J Klein Sep 27 '22 at 23:50
I deleted it because I think my answer didn't help. – Tibic4 Sep 28 '22 at 21:48
I awarded the bounty here because 1. otherwise it just vanishes and 2. you tried to answer and don't have much reputation yet. – coredump Sep 28 '22 at 22:02

Suggestions on solving locking problem with ARM 7 CCL Lisp compiler, raspberry Pi?

1 Answers1