Does it take longer to read memory-mapped IO than regular memory on a microcontroller?

Question

My specific context is STM32 ARM M0, but the question is more general.

Does it take the same number of clocks to read or write the contents of a memory-mapped peripheral (a GPIO port, for instance, or a serial port buffer) as a location in physical RAM? Does this differ from architecture to architecture?

varies by design so same company different families or different chips. you can eve see that in the stm32 parts as they will show you the clock tree and often dictate the speed limits for the peripheral clocks. If the gpio or uart or something has a clock speed limit slower than the sram clock speed limit do you expect them to take the same time? — old_timer, Aug 11 '19 at 02:48
what happened when you did a performance test? were they the same speed. — old_timer, Aug 11 '19 at 02:49
Thank you for your comments, @old_timer. I haven't done any performance tests myself. I expected to find this information in ST's documentation, perhaps the datasheet or the reference manual for the relevant microcontroller, but I saw nothing on this in TFM. — iter, Aug 11 '19 at 19:26
what specific chip is this? and the exact timing is pretty much never documented. what is documented is that often the clock tree shows that clock alone some blocks are slower, some are the same. Even if the clock is the same then that doesnt dictate the peripherals respond as with sram, so you have to just try it. You may find some chips where the performance is on par, I would generally expect not...but you just have to test it. — old_timer, Aug 11 '19 at 19:37
didnt look too long, and there are many stm32f042 parts I only looked at one of them. the clock tree does not force nor limit the PCLK. If you look at other parts from ST or others there are speed limits on the peripherals insuring that system ram can be faster than the peripherals other factors excluded. even at the same clock speed doesnt indicate that a particular peripheral (nor the sram) respond in a fixed number of clocks or that all targets respond in the same number of clocks as each other. so the clock tree in this case doesnt automatically indicate a performance difference. — old_timer, Aug 13 '19 at 04:47

score 5 · Accepted Answer · edited Aug 12 '19 at 16:53

Almost always yes. The AHB or AXI bus is much faster than APB buses. Not only is the clock slower, but also so is the bus width. It costs power and die area to make things fast. A serial port, with max baud of 115200, doesn't need to be as fast as a DDR or serial SPI flash controller. To mitigate this, some software will RAM shadow peripheral registers to speed up drivers. Generally vendors don't document APB bus speeds as they use IP from ARM. Some ARM document somewhere will tell you. Almost always, your core memory will be very fast; especially TCM on a cortex-M.

The ARM is a load/store architecture. It means there are specific instruction to load/store from register to memory. It is not possible to directly operate on memory. For instance, some CPUs let you add a constant to a memory value. As a consequence there is usually a pipeline stage for 'load' and 'store'. It is possible that any memory might have wait states during the stage. Your compiler and the CPU will know this and typically try to get as much performance as possible. This can be a disaster if you are assuming a memory order to a device.

It is usually faster to implement a register cache if you have driver read and write routines. It good to wrap register reads and writes in inlines or defines as the bus can change in future. Wrapping read/write can be imperative to ensure the ordering of access to a peripheral. volatile by itself may not be enough for memory mapped I/O. Tomorrow hardware might change to SPI or something else to conserve pin count. It is easy to add shadowing if you wrapped access.

From the diagram at embedds.com, you can see Flash/RAM on the AHB bus and peripherals on APB. This means peripherals are slower.

Maybe helpful: ARM peripheral address bus architecture

Gabriel Staples · Answer 2 · 2019-08-12T23:13:59.013

Do some tests, and report back your results! Grab an oscilloscope and do "oscilloscope profiling". This means using an oscilloscope to measure the time an operation takes. Do this by using direct-register-pin-writes (for speed and consistency) to write a pin HIGH before doing a bunch of register test writes, and LOW after.

How to do oscilloscope profiling of embedded source code

Ex: to write a pin HIGH/LOW

// set pin HIGH (set to 1)
GPIOA_ODR |= 1UL << (uint32_t)pin_index;
// set pin LOW (clear to 0)
GPIOA_ODR &= ~(1UL << (uint32_t)pin_i);

Surround your test code with these:

// set oscilloscope profiling pin HIGH
// do your operations you'd like to time
// set oscilloscope profiling pin LOW

Watch the square wave on the oscilloscope. The high pulse time minus one of the transition times = the time the operation took!

ie: your equations are as follows:

total_time = time_transition_to_LOW - time_transition_to_HIGH - pin_write_time.

To get pin_write_time, which is how long writing a pin HIGH or LOW takes (but not both combined, just 1), make a quick application to write HIGH then write LOW with no delay between the two. Take care to use write techniques which making writing LOW and HIGH take the same number of clock cycles (ie: by using the GPIOA_ODR register, as I show above, rather than GPIOA_BSRR or GPIOA_BRR, which take different numbers of clock cycles last I checked depending on whether you are writing a pin HIGH or LOW). Now, measure the total time of that on the oscilloscope, and for this test:

pin_write_time = time_transition_to_LOW - time_transition_to_HIGH

To write to a specific address in RAM, since you'll need to compare this against register writes, do some fancy pointer manipulation like this below. Let's assume the address you want to write to is 0x20000000. Here's an example of writing val to it.

uint32_t val = 1234567;
*((volatile uint32_t *)0x20000000UL) = val;

Take care not to overwrite actual variables in use in RAM. I won't go into further details, but you can guarantee this by modifying your linker script to reserve certain address spaces, or you can just check and print a few addresses of variables in use in your code and be sure to choose test addresses far away from these so you can have a pretty good idea you aren't overwriting real variables in use.

Note that you could, of course, just use normal code to write to a variable, but doing the above is important so you can test different addresses, as the memory is segmented based on address and different memory segments have different types of buses. Refer to the Memory Map in the datasheet for your chip. Ex: from DS11243 (DocID028294 Rev 6), p102, Figure 22. Memory map (see below), you can see that you have various RAM banks to test:

ITCM RAM
DTCM RAM
SRAM1
SRAM2

Note that reading/writing to/from the battery backed-up SRAM (BKPSRAM) and Flash requires special access procedures and/or functions, so the above pointer manipulation won't work by itself. You'll have to follow proper procedures as specified by the Reference Manual for your chip.

Anyway, do some tests and get back to us. I'm interested in your results.

References:

I don't think it is a good answer to say "do your test" :) I guess @iter is asking on SO specifically to avoid doing himself something that others may have done before or already know. Furthermore, not everyone has an oscilloscope available and it is not necessary (I guess you may use systick timer or DWT_CYCCNT if it is available on this part). Also measuring one single C instruction does not make lot of sense as its duration depends on many factors, cache, pipeline, etc... — Guillaume Petitjean, Aug 20 '19 at 13:32
Also it would make more sense to read/write a bunch of data (1KB or whatever) to get an average. The way you code this read/write and the setting of your compiler may also have a huge impact on the performance so you should write some assembly code to avoid that. — Guillaume Petitjean, Aug 20 '19 at 13:35

score 2 · Answer 3 · answered Aug 14 '19 at 20:06

Not of necessity. The detail of the answer will depend on the implementation (as described in the other answers). In the ARM architecture, a memory read does not care about the peripheral at the other end of the interconnect - be it a peripheral or a memory controller.

In a low performance part, both types of access could be single-cycle. As soon as the clock speeds start to increase, there are implementation details to consider - and it is more likely that at least some memory accesses will be optimised (since instruction fetch is the primary limit of execution performance).

With a high performance real-time optimised implementation, you can probably find some memory access which is much slower than the latency optimised peripheral accesses - but there will still be some 'tightly coupled memory' which will perform as fast or faster than these peripherals.

Does it take longer to read memory-mapped IO than regular memory on a microcontroller?

3 Answers3

How to do oscilloscope profiling of embedded source code

References: