2

I try to understand what happens physically on the data bus when a STM32H7 (Cortex M7) is executing a LDRB instruction (assuming the caches are disabled, to simplify). Is there a 32 bits access to the memory and 3 out of 4 bytes are trashed ? Does it depend on the type of memory ? If the code is doing four LDRB on consecutive addresses, how does it compare (in terms of number of cycles) to doing a single 32 bits LDR ?

Guillaume Petitjean
  • 2,408
  • 1
  • 21
  • 47
  • STM32 covers a number of different ARM microarchitectures. Also, I think that ARM doesn't publish cycle timing tables for at least some of the microarchs. – Thomas Jager Jul 02 '19 at 13:24
  • I will precise the version. – Guillaume Petitjean Jul 02 '19 at 13:25
  • 1
    read the arm documentation on the amda/axi/ahb busses. It would be wasteful to do a single byte write so a bus width is normally performed. 50% of your answer is specific to the chip vendor as they interface to the arm bus(ses). since you cannot tell the size of a read other than in units of bus width, they wouldnt be able to make it a partial bus transfer. There have been exceptions but if performance and cost are a factor then the whole system is going to be at least 32 bits wide if not 64. otherwise you are wasting cycles and dollars. – old_timer Jul 02 '19 at 16:56
  • 1
    Related (the cache-enabled case): [Are there any modern CPUs where a cached byte store is actually slower than a word store?](//stackoverflow.com/q/54217528) - old_timer's microbenchmark shows that *cached* byte and word loads on Cortex-m7 are the same speed. Only byte / halfword *stores* are slower. 4 consecutive STRB might possible be efficient, if that CPU has a store buffer. But it might be too simple a pipeline for that. Higher-end CPUs with a store buffer can merge byte stores before they commit to cache. (But loads don't sit around in a buffer unless it misses, each one just happens) – Peter Cordes Jul 02 '19 at 21:55
  • @PeterCordes, thank you, quite interesting. What I understand is that, cache enabled, doing N word loads in a row takes the same time as doing N byte loads. So at the end if you need to read N bytes, it would be 4 times faster to use word loads (counting only load instructions). Is it correct ? – Guillaume Petitjean Jul 03 '19 at 07:35
  • Yes, assuming cache hits of course. That should be true in the uncacheable case as well. – Peter Cordes Jul 03 '19 at 07:38

1 Answers1

2

Cortex-M7 has a 64-bit AMBA4 AXI interface.

This is only part of the answer since this data bus will connect to a memory of the STM32H7 somewhere, but we can assume that memory has an interface that is at least as wide as the bus. The memory controller will most likely read the full width from the memory (but maybe not at core frequency).

The read data will be returned on the bus, occupying the read channel for however many cycles the handshake takes. For a byte read, the data returned should be a byte.

Performing 4 byte reads could avoid the external memory access, but keeps the bus busy for 4 transfers. The bus can support multiple outstanding transfers (limited by the chip design, not the processor). Architecturally, the processor is permitted to merge the transfers (but this would naturally be done by the cache, which you have disabled).

At a first order approximation, you can load 8 32 bit registers in the same number of cycles as performing 4 byte reads, since there is a 64 bit AXI. Actually, it can be faster because you can use a single LDM instruction rather than 4 LDRB, and instruction fetches share the same bus.

It should be noted that stores are potentially more complex because it is harder to build the logic to ignore partial write data, and fairly easy to merge writes.

(This is a 'generic' answer rather than a reflection of the M7 micro-architecture, you need to do your own benchmarking to understand the detailed implications of your question).

Sean Houlihane
  • 1,698
  • 16
  • 22