1

I learn 'Computer Organization and Design' RISC-V version by David A. Patterson, and on page 254 Elaboration have below code

below is book contents and related code:

While the code above implemented an atomic exchange, the following code would more efficiently acquire a lock at the location in register x20, where the value of 0 means the lock was free and 1 to mean lock was acquired:

addi x12, x0, 1
 // copy locked value
again: lr.d x10, (x20)
 // load-reserved to read lock
bne x10, x0, again
 // check if it is 0 yet
sc.d x11, x12, (x20)
 // attempt to store new value
bne x11, x0, again
 // branch if store fails

which is changed from (based on) original after adding lock

Since the load-reserved returns the initial value, and the store-conditional returns 0 only if it succeeds, the following sequence implements an atomic exchange on the memory location specified by the contents of x20:

again:lr.d x10, (x20)
 // load-reserved
sc.d x11, x23, (x20)
 // store-conditional
bne x11, x0, again
 // branch if store fails
addi x23, x10, 0
 // put loaded value in x23

1- the book says addition of lock to the code by addi x12, x0, 1 // copy locked value is 'more efficient' which I don't get where it is

2- I think this lock can't avoid 'spuriously fail' based on 'cache line' hardware design, am I right?

zg c
  • 113
  • 1
  • 1
  • 7
  • You changed it from a compare-exchange to a simple exchange that only fails on contention, not checking the old value. Those are different operations, like C++ `.compare_exchange_strong` (since it has a retry loop) vs. `.exchange`. Either one can work for a spinlock, but storing a `1` when the value's already there will dirty the cache line when you didn't need to. Is this trying to take a spinlock? It actually not just a CAS_strong, it's retrying on mismatch, too, so it's like `do{}while(! lock.cas_weak(0, 1));` or something? – Peter Cordes May 31 '23 at 14:03
  • 2
    2. Correct, you can't avoid spurious failure with LL/SC instructions. – Peter Cordes May 31 '23 at 14:08
  • 1
    What do you mean "addition of lock"? Adding what to what? Are these loops trying to take a spinlock? The second one doesn't retry if the lock was already held by another thread (`x10 != 0`) so they're not equivalent, IDK what you're asking. – Peter Cordes May 31 '23 at 14:10
  • 1- Answering to comment 3, I have improved questions. 1.1- Although the book doesn't use term 'spinlock' , based on the loop implied by 'bne', I think it is spinlock in both conditions above. 1.2- I want to ask why first code 'more efficient' than second. And you answered it that first will first check lock in one loop, but it may ensure safety, is it always 'more efficient'? – zg c Jun 01 '23 at 12:24
  • In your version, what does `x23` hold on input? You're doing `x10 = x20->exchange(x23)` ; `x23 = x10`. The other code is doing retrying `x20->compare_exchange_weak(0, 1)` until it succeeds, with no input variables except the pointer. `addi x12, x0, 1` just materializes a constant in a register so it can store it. – Peter Cordes Jun 01 '23 at 12:33
  • 2- answering comment 1, I agreed with you. The lock ensure the exclusive ownership of shared value in some way, so just store without checking. 2.1- Since 'will dirty the cache line', does it mean the first code isn't 'more efficient' in some way? 2.2- does 'storing a 1' refers to `addi x12, x0, 1`? and 'mismatch' means that lock is as expected as 0? – zg c Jun 01 '23 at 12:34
  • I was still editing my last comment when you posted your. Have another look. You didn't answer my question about `x23` in yours, and I added a sentence at the end about putting the `1` in a register. – Peter Cordes Jun 01 '23 at 12:37
  • 3- Answering the fourth and sixth comment. Sorry for not offering the book context. 3.1- The first code use `x20` to store lock, the second code use `x20` as memory data to be exchanged with `x23`, which is weird at first glance. The book just writes as posted. So they may not doing the same thing at all. 3.2- What I want to know is how lock with `lr/sc` is 'more efficient' than `lr/sc` alone. Not caring too much that the book didn't improve code by adding code to original function code, instead it seems to rewrite one new function code. Sorry for making such a poor clarification. – zg c Jun 01 '23 at 12:52
  • 4- One small question to comment 1, since `bne x11, x0, again` is in first code, why here is `lock.cas_weak(0, 1)` instead of `lock.cas_strong(0, 1)` which will keep checking lock just as `bne` does. I have seen your [another answer](https://stackoverflow.com/questions/72766332/c11-how-to-produce-spurious-failures-upon-compare-exchange-weak) where you give assembly in arm. I saw `cbnz w4, .L5` in your Godbolt link whose function is similar to `bne x11, x0, again`. Although program will exit if load fails in `cas_strong` , above will not because of `bne x10, x0, again` spinlock – zg c Jun 01 '23 at 13:03
  • 1
    It's an optimized `do{}while(! x20->cas_weak(0, 1));`. It retries the same way whether failure was real (compare) or spurious (SC failure). With `expected` fixed at `0`, I guess it wouldn't matter if you described it as `cas_strong` either, though. To understand the point the book is making, though, you really have to look at the actual RISC-V operations it's doing, not in terms of C++ cas_weak or cas_strong. That's only helpful in terms of describing the algorithm, not efficiency considerations. – Peter Cordes Jun 01 '23 at 13:29
  • 1
    The book is making a point about using this loop to try to acquire a lock, instead of just putting a loop around an LL/SC `exchange`. Storing a `1` when the value in memory is already a `1` isn't useful, and just causes extra cache-coherency traffic. At least I think that's the point they're making. The two blocks of code *don't* do the same thing. – Peter Cordes Jun 01 '23 at 13:34
  • I understand what you say in the above two comments, but could you help me with why the book author says lock is 'more efficient' than lr/sc, which is my original puzzlement? Thanks for answering the above long comments. – zg c Jun 01 '23 at 16:46
  • 2
    I think they mean fewer instructions than `do{}while(x20->exchange(1) == 0)` which is the obvious way to use their exchange function to take a spinlock. Possibly also a benefit to not storing at all when the load sees a non-zero value. (So this core doesn't take exclusive ownership of the cache line when it can't do anything useful with it until after another core has stored a `0` to it.) But I'm not sure if `lr.d` along would try to get exclusive ownership (send an RFO = read for ownership) in anticipation of an SC. It at least doesn't dirty the cache line so doesn't have to write it back. – Peter Cordes Jun 01 '23 at 16:49
  • Answering to second to last comment and the last comment, I think I misunderstood what the book author want to say because the book got high praise from many readers without saying obscurity of the book. – zg c Jun 02 '23 at 03:01

1 Answers1

2

I think the authors mean fewer instructions than do{}while(x20->exchange(1) == 0), which is the obvious way to use their exchange function to take a spinlock. (Which is what their loop does. In C++ terms its somewhat like do{}while(! lock.cas_weak(0, 1));, but the points they're making about asm efficiency are specific to the LL/SC)

Possibly also a benefit to not storing at all when the load sees a non-zero value. (So this core doesn't take exclusive ownership of the cache line when it can't do anything useful with it until after another core has stored a 0 to it.) But I'm not sure if lr.d along would try to get exclusive ownership (send an RFO = read for ownership) in anticipation of an SC. It at least doesn't dirty the cache line if the compare fails, so it doesn't have to write it back.

That may also reduce livelock problems with multiple threads waiting for the lock, all running this loop.

For some related discussion on having multiple spinning, and read-only vs. atomic RMW accesses see Locks around memory manipulation via inline assembly (x86 which only has single-instruction CAS_strong, not LL/SC).


  1. Correct, you can't avoid spurious failure with LL/SC instructions.
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847