2

I know this might be a strange usage. I just want to know if I can use LDREX/STREX with SCU disabled.

I am using a dual-core Cortext-A9 SoC. The two cores are running in an AMP mode: each core has its own OS. Although memory controller is shared resource, each core has its own memory space. One can't access the other's memory space. Because no cache coherency is required, SCU isn't enabled. At the same time, I also have a shared memory region that both cores can access to. The shared memory region is non-cached to avoid cache coherency issue.

I define a spin lock in this shared memory region. This spin lock is used to protect shared resource accessing. Right now, the spin lock is implemented simply like this:

void spin_lock(uint32_t *lock)
{
    while(*lock);
    *lock = 1;
}
void spin_unlock(uint32_t *lock)
{
    *lock = 0;
}

where, lock is a variable in shared memory so both core can access this lock.

The problem of this implementation is that accessing lock is not exclusive. That's why I want to use LDREX/STREX to implement spin lock. Please allow me to restate my question:

Can I use LDREX/STREX without SCU enabled?

Thank you!

artless noise
  • 21,212
  • 6
  • 68
  • 105
yelInv
  • 23
  • 6

3 Answers3

2

So ... the direct answer to your question is that, yes, it is possible - so long as something else out in the memory system implements an exclusive monitor for the shared memory region. If it does not, then your STREXs will always return OK (rather than EXOK), observable as a failure in the result register.

However, why would you not enable the SCU? Clearly, what you are trying to do requires a coherent view of memory between the two operating systems for at least that region. And with PIPT data caches, you are not going to see any aliasing of cache lines depending on how they are mapped in each image.

unixsmurf
  • 5,852
  • 1
  • 33
  • 40
  • Thank you @unixsmurf. Only concern that I hesitate to enable SCU is the performance. We have very strict performance requirement and currently we are only on the edge of the requirement. We might optimize some code to improve performance. But enabling SCU might have a fundamental affect to performance. Does anyone has a picture that how much SCU affects system performance? The OS we are using is Nucleus 3.x. – yelInv Apr 27 '15 at 14:45
  • 1
    I don't think the SCU affects performance at all, except where its operation forces eviction/migration of cache lines. Which in your use-case would only be for the memory regions actually shared between the images, where it would still be more efficient than a software-based, uncached, mechanism. Of course, measuring that this is actually the case would be good practice. – unixsmurf Apr 27 '15 at 15:05
1

Overall, the answer is no. There are two issues here:

1) You cannot use load/store exclusive on uncached memory. The exclusive operations operate only on "normal" idempotent memory.

2) The ARM manual doesn't specify how exclusive monitors work in conjunction with memory coherence, but any sane implementation is essentially going to put the monitor in the cache line acquisition mechanism. If you disabled cache line snooping, you have most likely rendered the monitors non-functional on your chip.

Variable Length Coder
  • 7,958
  • 2
  • 25
  • 29
  • 1
    Hmm, AFAICS there's nothing in the ARM ARM that says Normal non-cacheable memory won't work - only that Device and Strongly-ordered are a no-no. Provided the shareability attribute is correct the exclusives will be handled by the global monitor in the memory system, not each core's local monitor. – Notlikethat Apr 22 '15 at 08:11
  • +1 (a while ago) for giving information that is not in the ARM ARM. While your answer is not 100% correct, it show some critical thought and pragmatic advice, which I don't think is elsewhere and should probably be emphasized for people on StackOverflow. – artless noise Apr 22 '15 at 16:53
  • 1
    you can use ldrex/strex on uncached memory the axi bus works out there and has the exclusive access signals, the problem is that is in chip vendor specific territory so the answer is specific to each rev of each chip from each vendor which is a larger set of maybies than the arm cores which are yes if the cache is on it works. – old_timer Apr 22 '15 at 23:24
1

Your only (poorly formed) question,

Can I use LDREX/STREX without SCU enabled?

In an ideal ARM universe, yes, it is possible. Ie, it is possible that somewhere, some day you might be able to do this. I think you mean,

Can I use LDREX/STREX without SCU enabled in my system?

Unfortunately, the ARM ARM is a bit of a political/bureaucratic document. You must take extreme care when reading "strongly advised", "UNPREDICTABLE" "UNKNOWN" and can. All programmers would desire the ldrex/strex to apply to all memory. In fact, if the BUS controller (typically AXI-NIC) implemented a monitor, then there would be no trouble to support the much loved swp instruction. There are various posts on StackOverflow where people want to replace the swp with an ldrex/strex.

After you read and re-read the double speak (it is written for the programmer, but also the silicon implementer) of the ARM ARM, it becomes pretty clear that the monitor logic is probably implemented in the cache. A cache controller must implement dirty line broadcasts. Dirty line broadcasts are very similar to a 'monitor' and your 'reserve granule' is most likely a cache line size (what a co-incidence).

The ARM ARM is written as a generic document for people who may wish to implement a Cortex-A CPU. It is written so that their hands (creativity) are not tied to implement the monitor with-in the cache.

So you need to read the specific documentation on your particular Cortex-A9 SOC. It will probably only support ldrex/strex with cached memory. In fact, it is advisable to issue a pld to ensure the memory is in cache before doing the ldrex and this will mean you need to activate the SCU in your system. I guess you are concerned about some additional cycle(s) that the SCU will add to latency?

I think some of this information has confuse many extremely intelligent people. Beware the difference between possible and is. Every person on StackOverflow probably desires the case where the monitor is implemented in the bus controller (or core memory chip). However, for most real chips, this is not the case.

For certain, if you want to future proof your code/OS to port to newer or different Cortex-A CPUs, you should not make this assumption even if your chipset does support a 'global monitor' outside the cache sub-systems.

artless noise
  • 21,212
  • 6
  • 68
  • 105
  • The signals to support monitors in the ACE/AXI protocol seem to be **ARLOCK**, **RACK**, **EXOKAY** and **OK**. The [ARM SOC bus](http://stackoverflow.com/questions/28068525/can-you-help-me-understand-peripherals-addressing-and-bus-architecture-in-arm-ba) can be complex and you need support from all involved masters/slave and possible other flow-through items like a TZASC, etc for the answer to be 'yes' (and for them to be hooked up correctly). While the documents on the protocol define the signals, even some IP from ARM intend (and used) in some Cortex-A designs don't mention these signals. – artless noise Apr 22 '15 at 17:48
  • The silicon side documents say you dont need to support EXOKAY for uniprocessor systems, but the software side documents say dont use sub use ldrex/strex instead. The ARM caches support ldrex/strex on a uni or multiprocessor system, but it is chip vendor dependent as to what happens if there is a cache miss, so the answer is "it depends". the ldrex/strex are intended for multicore processors to share resources across cores, not for one core to do a lock for itself. – old_timer Apr 22 '15 at 23:23
  • Thank you @artlessnoise for the detailed explanation. Yes, you are right, I am not sure where the global monitor is implemented. And I hesitated to enable SCU just for a very small portion of memory coherency due to potential performance issue. Not sure how much SCU will affect performance. But we do have very strict performance requirement. Is it possible to implement a spin lock without using LDREX/STREX? – yelInv Apr 27 '15 at 14:38
  • No, it is not possible (if LDREX/STREX don't work, then `swp` won't either). The point that I think UnixSmurf was making is that if you enable cache, you need the SCU. The SCU penalty is much less than always missing cache. If you need hard-RT, then lock something in the cache, etc. I think your reason for disabling the SCU seems mis-guided? ...but that does not answer the direct question. – artless noise Apr 27 '15 at 16:28