Cortex M4 LDR/STR timing

Question

I am reading through Cortex M4 TRM to understand instruction execution cycles. However, there are some confusing description there

In Table of Processor Instuctions, STR takes 2 cycles.
Later in Load/store timings, it indicates that

STR Rx,[Ry,#imm] is always one cycle, This is because the address generation is performed in the initial cycle, and the data store is performed at the same time as the next instruction is executing.

If the store is to the write buffer, and the write buffer is full or not enabled, the next instruction is delayed until the store can complete.

If the store is not to the write buffer, for example to the Code segment, and that transaction stalls, the impact on timing is only felt if another load or store operation is executed before completion

Still in Load/store timings, it indicates LDR can be pipelined by following LDR and STR, but STR can't be pipelined by following instructions.

Other instructions cannot be pipelined after STR with register offset. STR can only be pipelined when it follows an LDR, but nothing can be pipelined after the store. Even a stalled STR normally only takes two cycles, because of the write buffer

More specific on what confused me:

Q1. 1 and 2 seems conflict with each other, how many cycles do STR actually take, 1 or 2? (My experiment shows 1 though)

Q2. 2 indicates that if store go through write buffer and it is not available, it will stall the pipeline nevertheless, but if store bypass it, the pipeline may only stalled when load/store instructions are following. Smells like write buffer can only make things worse. That is contrary to common sense.

Q3. 3 means STR can't be pipelined with following instruction, however 2 means STR is always pipelined with following instruction under proper condition. How to understand the conflicting statements? (And here it indicates STR takes 2 instead of 1 cycle because of the write buffer)

Q4. I don't find more information on how write buffer is imeplemented. How large is the buffer? How STR determine whether to use it or bypass it?

its pipelined so an instruction can finish before the write finishes. the size of the write buffer is in the arm documentation. trying to count instruction cycles by looking at instructions generally fails on a pipelined architecture. — old_timer, Aug 05 '20 at 14:06
Therefore, with newer ARM processors, a distinction is made between latency (when an instruction is completely processed) and throughput (when can the next similar but independent instruction be started). — fcdt, Aug 05 '20 at 23:00

EmbeddedSoftwareEngineer · Answer 1 · 2020-09-10T17:10:04.667

Type of STR Note that on "Load/Store timings page" the first statement refers to STR with a literal offset to the base address register (STR Rx,[Ry,#imm]). Further down it refers to an STR with a register offset to the base address register (STR R1,[R3,R2]). These are two different variants of the STR instruction.

Literal Offset STR(STR Rx,[Ry,#imm]) Hmm, I wonder if the documentation is mis-leading when it says "always 1 cycle", because it then follows to add a caveat that means it could take multiple cycles "... the next instruction is delayed until the store can complete"

I am going to do my best to interpret the documentation:

STR Rx,[Ry,#imm] is always one cycle. This is because the address generation is performed in the initial cycle, and the data store is performed at the same time as the next instruction is executing. If the store is to the write buffer, and the write buffer is full or not enabled, the next instruction is delayed until the store can complete. If the store is to the write buffer, for example to the Code segment, and that transaction stalls, the impact on timing is only felt if another load or store operation is executed before completion.

I would assume that the first STR takes 1 cycle, if the write buffer is available. If it is not available, the next instruction will be stalled until the buffer is available. However, if the buffer is not in use, it will delay the next instruction until the bus transaction completes.

With a non consecutive STR (the first STR) the write buffer will be empty, and the instruction takes 1 cycle. If there are 2 consecutive STR instructions, the 2nd STR will begin immediately as the 1st STR has written to the buffer. However, if the bus transaction for the 1st STR stalls and remains in the write buffer, the 2nd STR will be unable to write to the buffer and will block further instructions. Then when the bus transaction for the 1st STR completes the buffer is emptied and the 2nd STR writes to the buffer, unblocking the next instruction.

A stalled bus transaction, where the transaction is buffered in the write buffer, doesn't affect non STR instructions as they do not need access to the write buffer to complete. So an STR instruction where the bus is stalled will not delay further instructions unless it is another STR. However, if the write buffer is not in use then a stalled bus transaction will delay all instructions.

It does seem a bit off that the instruction set summary page puts a solid "2" as the number of cycles for STR when clearly it is not as predictable as this.

Register offset STR(STR R1,[R3,R2]) I stand with you on your confusion over the following apparently conflicting statement:

Other instructions cannot be pipelined after STR with register offset. STR can only be pipelined when it follows an LDR, but nothing can be pipelined after the store. Even a stalled STR normally only takes two cycles, because of the write buffer.

As this is contradicted by the first clause on the page. But, I believe this is because it is refering to 2 different STR types, literal offset (the first one) and register offset. The register offset STR being the one that can't allow pipelined instructions afterwards. The language could be clearer though. What does it mean by a stalled STR, is it refering to a register offset STR which always stalls by default? Is this stall different to a stall caused by the write buffer being unavailable? It is easy to get lost here.

I think basically a register offset STR is a minimum of 2 cycles. It is going to block and take more cycles if the write buffer is unavailable, or if the transaction is not buffered and the bus stalls.

Size of write buffer The size is a single entry, see https://developer.arm.com/documentation/100166/0001/Programmers-Model/Write-buffer?lang=en

To prevent bus wait cycles from stalling the processor during data stores, buffered stores to the DCode and System buses go through a one-entry write buffer. If the write buffer is full, subsequent accesses to the bus stall until the write buffer has drained.

The write buffer is only used if the bus waits the data phase of the buffered store, otherwise the transaction completes on the bus.

Usefulness of write buffer As far as my understanding goes: If the CPU could write to a bus instantly then it would not need a buffer as the bus would be free immediately for the next instruction. On a high performance part like M4 some of the memory buses can't keep up with the CPU clock rate which means it could take multiple cycles to perform a transaction. Also there could be DMA units that make use of the same bus. To prevent stalling the CPU until a bus transaction completes, the buffer provides an immediate store to use which hardware then writes to the bus when it is free.

The term "buffer" is fully compatible with being a FIFO, even for software. It's not rare to implement a FIFO with a circular buffer. e.g. Unix pipe buffers are kernel data structures that buffer data in a fifo manner between write and read of a pipe. TCP send/receive buffers are obviously used in a FIFO manner, and are *very* similar to a computer-architecture store buffer. (See https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/) "store buffer" is the standard terminology for such a queue between store execution and commit to cache (or memory). — Peter Cordes, Sep 10 '20 at 16:15
And yes, a store buffer is very useful to decouple execution from memory store latency (and bandwidth if the buffer is larger). High-performance CPUs have much larger store buffers, but almost all pipelined CPUs have them, even simple in-order CPUs like cortex-m. [Size of store buffers on Intel hardware? What exactly is a store buffer?](https://stackoverflow.com/q/54876208) — Peter Cordes, Sep 10 '20 at 16:17
What I was trying to express was that typical use of terminology in hardware documentation uses the term "buffer" to mean a single-element store. Typically in hardware documentation if there are multiple elements buffered it will be given a specific description such as FIFO, LIFO or register bank for example. I was assuming the OP was coming from the perspective of software where we use the term buffer to refer to arrays for example. I agree in software a buffer could be a circular buffer, FIFO, LIFO — EmbeddedSoftwareEngineer, Sep 10 '20 at 16:18
I haven't read ARM manuals so I don't know if they'd avoid the term "buffer" for a multi-element store buffer, but "store buffer" is definitely the standard computer-architecture term for a multi-element write buffer. One entry is a "store buffer entry". In other contexts in computer architecture, like queue between pipeline stages, or a "reorder buffer" for out-of-order exec, the term "buffer" is standard for a multi-element structure. Unless it's an ARM thing, I don't agree that "buffer" normally means 1 entry in a HW context. — Peter Cordes, Sep 10 '20 at 16:28
@EmbeddedSoftwareEngineer, thanks for the reply. I'd like to post what I summarized from my experiment — Eric Sun, Sep 11 '20 at 03:26

score 2 · Answer 2 · answered Sep 11 '20 at 04:20

@EmbeddedSoftwareEngineer, thanks for the reply. I'd like to post what I summarized from my experiment

As a baseline, LDR takes 2 cycles, STR takes 1 cycle
There are 2 kinds of dependency for adjacent instructions
- content dependency. A typical example is STR followed by a LDR, because the assembly don't make sure the LDR target memory is not modified by STR, it always get delay，that is 3 cycles for LDR
- addressing dependency. When 2nd instruction's address is based on result of first instruction, the 2nd instruction always get delay, typical example
```
sub SP, SP, #20
ldr r1, [SP, #4]
；OR
ldr r3, [SP, #8]
ldr r4, [r3]
```
  The second LDR will always get an extra wait cycle, yields 3 cycles
When there is no dependencies described in 2, LDR following LDR will take 1 cycle, STR following LDR will take 0 cycle

All these are based on TCM which introduce no extra cycle from cache load or external bus stall.

I am using systick to measure a function implemented in assembly, first create a skeleton to get a time basis, then add instruction one by one to see the delta — Eric Sun, Sep 14 '20 at 13:00

Cortex M4 LDR/STR timing

2 Answers2

Linked