Type of STR
Note that on "Load/Store timings page" the first statement refers to STR with a literal offset to the base address register (STR Rx,[Ry,#imm]
). Further down it refers to an STR with a register offset to the base address register (STR R1,[R3,R2]
). These are two different variants of the STR instruction.
Literal Offset STR(STR Rx,[Ry,#imm]
)
Hmm, I wonder if the documentation is mis-leading when it says "always 1 cycle", because it then follows to add a caveat that means it could take multiple cycles "... the next instruction is delayed until the store can complete"
I am going to do my best to interpret the documentation:
STR Rx,[Ry,#imm] is always one cycle. This is because the address generation is performed in the initial cycle, and the data store is performed at the same time as the next instruction is executing. If the store is to the write buffer, and the write buffer is full or not enabled, the next instruction is delayed until the store can complete. If the store is to the write buffer, for example to the Code segment, and that transaction stalls, the impact on timing is only felt if another load or store operation is executed before completion.
I would assume that the first STR takes 1 cycle, if the write buffer is available. If it is not available, the next instruction will be stalled until the buffer is available. However, if the buffer is not in use, it will delay the next instruction until the bus transaction completes.
With a non consecutive STR (the first STR) the write buffer will be empty, and the instruction takes 1 cycle. If there are 2 consecutive STR instructions, the 2nd STR will begin immediately as the 1st STR has written to the buffer. However, if the bus transaction for the 1st STR stalls and remains in the write buffer, the 2nd STR will be unable to write to the buffer and will block further instructions. Then when the bus transaction for the 1st STR completes the buffer is emptied and the 2nd STR writes to the buffer, unblocking the next instruction.
A stalled bus transaction, where the transaction is buffered in the write buffer, doesn't affect non STR instructions as they do not need access to the write buffer to complete. So an STR instruction where the bus is stalled will not delay further instructions unless it is another STR. However, if the write buffer is not in use then a stalled bus transaction will delay all instructions.
It does seem a bit off that the instruction set summary page puts a solid "2" as the number of cycles for STR when clearly it is not as predictable as this.
Register offset STR(STR R1,[R3,R2]
)
I stand with you on your confusion over the following apparently conflicting statement:
Other instructions cannot be pipelined after STR with register offset. STR can only be pipelined when it follows an LDR, but nothing can be pipelined after the store. Even a stalled STR normally only takes two cycles, because of the write buffer.
As this is contradicted by the first clause on the page. But, I believe this is because it is refering to 2 different STR types, literal offset (the first one) and register offset. The register offset STR being the one that can't allow pipelined instructions afterwards. The language could be clearer though. What does it mean by a stalled STR, is it refering to a register offset STR which always stalls by default? Is this stall different to a stall caused by the write buffer being unavailable? It is easy to get lost here.
I think basically a register offset STR is a minimum of 2 cycles. It is going to block and take more cycles if the write buffer is unavailable, or if the transaction is not buffered and the bus stalls.
Size of write buffer
The size is a single entry, see https://developer.arm.com/documentation/100166/0001/Programmers-Model/Write-buffer?lang=en
To prevent bus wait cycles from stalling the processor during data stores, buffered stores to the DCode and System buses go through a one-entry write buffer. If the write buffer is full, subsequent accesses to the bus stall until the write buffer has drained.
The write buffer is only used if the bus waits the data phase of the buffered store, otherwise the transaction completes on the bus.
Usefulness of write buffer
As far as my understanding goes: If the CPU could write to a bus instantly then it would not need a buffer as the bus would be free immediately for the next instruction. On a high performance part like M4 some of the memory buses can't keep up with the CPU clock rate which means it could take multiple cycles to perform a transaction. Also there could be DMA units that make use of the same bus. To prevent stalling the CPU until a bus transaction completes, the buffer provides an immediate store to use which hardware then writes to the bus when it is free.