how could stack on TCM reduce processor performance

Question

I'm measuring ARM cortex R5f processor performance by running coremark benchmark using different scenarios. one scenario is to set the STACK on ATCM memory.

when compiling without inline flag, STACK on TCM get better results. and when compiling with inline flag, STACK on RAM get better result.

how could this be explained given that TCM is faster and closer to processor.

there is no stack-overflow in my program when setting stack on TCM.

Will the TCM be used for data, if you don't put the stack there? With `inline`, the compiler will have reduced stack usage/spills so stack use will not be as performance critical. TCM is usually as faster (or faster) than even L1 cache. L1 may have synchronizations issues whereas TCM is dedicated per CPU. — artless noise, Jul 30 '15 at 13:24
@artlessnoise why would the compiler reduce stack usage when compiling with ìnline` . with `inline` the body of the called function will be copied the the calling function. so the local data of the called function are copied to local data or the calling function, and the data on stack is the same????? — bouqbouq, Jul 30 '15 at 13:31
Each time you call a function, the compiler emits a *stack frame*. When you `inline`, the compiler may get rid of the stack frames. If the function is particularily small, then the *stack frame* overhead maybe bigger. The compile will use **registers** instead of the stack frame with `inline`; without, it must adhere to the EABI and use predefined registers and spill all the time. Stack use with/without `inline` will vary, but in theory it can be reduced. It depends on the code of course. — artless noise, Jul 30 '15 at 14:02

score 1 · Accepted Answer · answered Jul 29 '15 at 10:10

1

How could this be explained given that TCM is faster and closer to processor.

Is your TCM faster than the L1 data cache? It isn't always (many designs have single cycle L1 D cache, but two cycle access to TCM).

The usual purpose of TCM is not performance (although it is nice), but predictability - you can't get cache misses in TCM so real-time systems use it for timing critical code and data sections.

answered Jul 29 '15 at 10:10

solidpixel

10,688
1
20
33

i still didn't get it. I set stack on TCM rather then on RAM. so what would be affected is the timing to stack access, but the L1 memory would be used in the same way as when STACK on RAM. I'm I mistaken – bouqbouq Jul 29 '15 at 10:17
1

@MakhloufGharbi Accesses to the TCM go there _instead of_ going to the L1 cache. If you had 1-cycle L1 and 2-cycle TCM as per the example, then any TCM access _always_ takes two cycles, whereas a RAM access will either take 1 cycle if you're lucky or tens to hundreds if you're not. If the access patterns of your code mean a RAM stack tends to stay hot in the cache, you're likely to be lucky _most_ of the time, but there's still no guarantee... – Notlikethat Jul 29 '15 at 12:29
@Notlikethat as I understand, when setting stack on RAM, the processor would fetch the stack RAM data to L1 cache and work with it. but when setting STACK on TCM, data would stay always on TCM (there is no fetch from TCM to L1 cache)? or I'm mistaken?? – bouqbouq Jul 29 '15 at 12:37

how could stack on TCM reduce processor performance

1 Answers1