I could see code snippets having #pragma ghs section bss=".mysection"
. I heard that these kind of custom code segments will improve execution/performance in a way. But I don't get it exactly how it functions internally & also how it contributes to the execution/performance. Thanks in advance.

- 53
- 4
-
The first answer address the specifics of what `#pragma ghs section bss=".mysection"` does. How `pragma` statements work in general is discussed [here](https://gcc.gnu.org/onlinedocs/cpp/Pragmas.html), [here](https://www.educba.com/hash-pragma-in-c/), and [here](https://www.geeksforgeeks.org/pragma-directive-in-c-c/). – ryyker Jul 12 '22 at 13:23
-
1The answer is mostly “it depends” — on the platform and the toolchain. There is no portable, general answer. – Jonathan Leffler Jul 12 '22 at 13:28
-
1What language, what system, what compiler? – Lundin Jul 12 '22 at 13:52
-
Let's assume it's GreenHills (GHS) Compiler for 32 bit microcontroller for C++ language @Lundin. However, I'd like to know the generic behaviour of this custom code sections – Sudhakar M Jul 13 '22 at 03:52
-
@SudhakarM That's still not very helpful. "32 bit microcontroller" could be anything from MC68k to a high-end PowerPC. Depending on which target we are talking about it may or may not have cache, may or may not be Harvard architecture etc etc. – Lundin Jul 13 '22 at 06:10
-
NXP Power PC MPC 5xxx series @Lundin – Sudhakar M Jul 13 '22 at 07:23
-
1@SudhakarM Then one of the reasons for using separate segments is because that ISA uses so-called "VLE" (variable-length encoding) instructions and it can execute both those and standard PowerPC ("Book E") instructions - as far as I remember it can use both in the same program. So it has nothing to do with _anything_ mentioned the 4 present guesswork answers posted... and this is why we shouldn't post answers before questions can be answered... – Lundin Jul 13 '22 at 09:51
-
@Lundin, Thank you. But what's the use of it? May be I didn't get you exactly – Sudhakar M Jul 15 '22 at 03:41
-
@SudhakarM There's plenty of documentation about it from NXP and others. Basically it allows for both 16 and 32 bit instructions. – Lundin Jul 15 '22 at 06:19
-
1NXP has processors with multiple cores, e.g. e200z4, e200z7. Each core can have a DTCM RAM (usually up to 64kB) close to it. If you plan to put certain data there, instead of L3 memory, you would use such pragma, e.g. task/isr stacks, or other data the core needs "fast". Same if you want to use some L3 memory as reset safe area, which is not cleared by your startup code. Or L3 memory split up into shared memory accessed by all cores (incl some cache sync / spinlock) vs data areas used solely by a single core. VLE is not really the reason for a `.bss` section mapping. – kesselhaus Jul 16 '22 at 10:50
5 Answers
This is a micro-optimization which is rarely beneficial.
In short, caches can benefit from "locality of reference". Data that is used together should be stored together, for faster speed of access. The same goes for code. You can override the compilers choice by putting some functions together in a single code segment. If you guess better than the compiler, you could see a performance increase. But if your guess is worse, the performance will decrease. And compiler vendors generally have more experience than you do.

- 173,980
- 10
- 155
- 350
-
1*And compiler vendors generally have more experience than you do.* - But do any compilers even try to group data based on optimization decisions, rather than simple declaration order (forward ore reverse) or whatever traversal order their internal data structures give? With profile-guided optimization, I think GCC can put the machine code of some functions into a `.text.cold` section, but I haven't noticed it grouping data/bss. – Peter Cordes Jul 12 '22 at 13:30
-
1@PeterCordes: Apparently VS2010 [stopped using this naive order](https://stackoverflow.com/questions/12102613/visual-studio-preserve-code-order-boundaries-with-compiled-code), which managed to surprise some people back then. It might depend on the "link-time code generation (LTCG)" feature though, since that gives the linker a better call graph. – MSalters Jul 12 '22 at 13:46
-
2Why are you assuming the OP is using a computer with cache memory? There isn't enough info to answer the question. For all we know they might be using an antique 8-bitter and then the reason for custom segments could be to utilize 8 bit addressing as another form of micro-optimization. – Lundin Jul 12 '22 at 13:54
-
@Lundin: Antique 8-bitters don't run C++. Even low-end ARM chips commonly have 16 kB L1c cache. Sure, that's tiny compared to the megabytes of cache on desktop/server x86, but cache also has a power benefit. Fetching instructions from memory takes inter-chip communication and will wake DRAM from sleep mode. – MSalters Jul 12 '22 at 14:05
-
@MSalters: ARM [Cortex-M microcontrollers](https://en.wikipedia.org/wiki/ARM_Cortex-M#Silicon_customization) often have no cache; it's not available on most versions, and optional on M7 and M35P, also "tightly-coupled memory" (lower latency RAM). Obviously if a CPU doesn't have any cache or special faster memory regions, this answer wouldn't be relevant, but then you probably wouldn't need to use sections to group data. Perhaps you meant "low end" relative to CPUs you'd use in a phone, a step up from a microcontroller. – Peter Cordes Jul 12 '22 at 14:08
-
@MSalters "Antique 8-bitters don't run C++." Ever heard of Arduino? As for ARM, Cortex M0 to M4 don't have cache. Cortex M7 is not "low end". – Lundin Jul 12 '22 at 14:09
-
Since we are given no information about the target and its memory map, it is not possible to say that it is an unnecessary micro-optimisation. Many MCUs have different memory types and bus architectures or external memory each having different access timing. – Clifford Jul 14 '22 at 06:47
-
Here are some examples of why I've used custom sections on my embedded microcontroller projects. It's not always for improved execution/performance.
- Microcontrollers often execute code from flash and sometimes access to flash is slower than RAM so I have created a custom section in RAM to contain a copy of a portion of code (such as an interrupt handler) that I want to execute as fast as possible.
- Flash can be erased and reprogrammed by the microcontroller but sometimes the microcontroller cannot execute code from the flash while the flash is being erased/reprogrammed. So I create a custom section to contain a RAM copy of the code that performs the flash erasing/reprogramming.
- Sometimes I want to store configuration or log data in nonvolatile flash memory. I want the data located at a fixed known location so that subsequent application revisions always know where to read the data from. I create a custom section to contain the configuration/log data at a known fixed location.
- Sometimes I want to store some data at a known fixed location so that the data can be accessed by multiple programs such as the bootloader and the main application. I'll create a custom section to contain the shared data at a fixed location.
- Sometimes microcontroller peripherals such as a DMA controller have strict alignment requirements such as the buffer must be 256-byte aligned or whatever. Or maybe the microcontroller has multiple RAM regions with different performance levels or peripheral accessibility. I'll create a custom section to locate the DMA buffer in desired RAM region with the required alignment.
There are probably more examples that I'm not thinking of. Basically anytime I want to control the location of some code or data, and I don't want to leave it up to the whim of the linker, then I create a custom section to force the code/data into the location that I want.

- 6,643
- 1
- 17
- 30
Here are cases for using different with default sections:
- Locate memory to different area: some system have multiple RAM area like: SRAM, DDRAM, DTCM,... by manual section, we could select different area for variable (depend on size, attribute,...)
- Memory protection: If you want to separate your source to multiple part with different security level. By using section, it is able to group the memory to configure MPU/MMU accordingly.
- For sharing with different controller which required specific address for a set of variable.
- Different attribute during initialization: For example if you need a very big variable and don't want to waste time for initialize it (if variable size > MB it will take time in embedded system).

- 162
- 7
Using sectioning pragmas is not micro optimization as stated by @MSalters. Also very few (if at all) embedded systems (even SoCs) have DRAM, they usually have SRAM. And as also stated in some comments already, not all ARM Cores even have Cache.
Here are some more reasons for using section pragmas:
In safety critical embedded systems, you want to split the ASIL A/B/C/D and QM code and data into different sections and apply certain access restrictions on these memory areas by Core & System MPUs.
You might have certain specially aligned memory areas due to DMA and peripherals. Maybe certain peripherals even require special data placement itself, e.g. a HW accelerator might be able to access only specific memories.
If you have a multicore system, which do not have the same core architecture e.g. TI SoC with ARM Cortex-R5F + ARM Cortex-M4 + DSP C66x, you have multiple binaries (separate compile/link/locate, even different compilers), but all of them need to access the same shared memories for data exchange, but separate core local data from each other. You might also separate their code into the different available memory banks.
An security HSM subsystem might have code and data, but also special SecureRAM for the keys.
You might have several different kinds of memories in you system, which can be configured to be used in certain ways, e.g. L1P/LL1D, L1 TCMA / TCMB, L2 RAM, L3 RAM, where even L1 and L2 could be configured partially as Cache or as ScratchPad RAM e.g. to keep interrupt vector tables, OS task/ISR stacks, keep ISR code closer to the core, instead of in L3 (even with Caching)
You might want to use certain reset safe areas to keep data over reset, which will not be cleared/initialized by startup code
You might have a flash bootloader and an application which need to be placed in different flash sections. And you might want to exchange also data between them.
You might want to split the implementation from the configuration and parameters. The config data might be placed in separate flash sections, that can be updated separatly without flashing the whole application.
- Allows production line to create and flash the ECU specific calibration set separately.
- Download a new parameter set, depending on the platform, e.g.:
- region code table of different vehicles
- sensor mounting parameters depending on vehicle chassis differences
- region specific function parameters
A hobby project might not need all of it, but in commercial development, all of the above could be reasons to use different sections besides the standard sections.

- 1,241
- 8
- 9
-
I disagree that it's valid practice to label certain sections of the program safety-critical, that's just not true and dangerously naive. Either the whole program is critical or it is not. Also, introducing the _extra complexity_ in a safety-related application just to split the program in different parts is already a safety hazard in itself. Keep it simple == keep it safe. – Lundin Jul 13 '22 at 06:20
-
1Your criticism of MSalters answer is perhaps valid, but belongs in the comment section to that answer. Your answer should stand on its own merit rather then a counter to some other answer. If the answer referred to is deleted, this will no longer make much sense. – Clifford Jul 14 '22 at 07:12
-
@Lundin If certain code is ASIL or not depends on the functionality. In automotive, a BSD/LCA function is QM, it's giving out a warning, but a function like EBA or RCTB which control the torque/brake are ASIL. If you run them on the same core, you want to split them, so they don't interfere. To get the whole SW ASIL qualified is maybe wanted, but not always possible. – kesselhaus Jul 16 '22 at 10:56
-
@kesselhaus My point is that you _can't_ split them because any part of a program can take down the whole MCU, by causing a hardfault exception due to bugs or just go runaway in general. Anyone who claims this can't happen and that it isn't a hazard haven't worked with microcontrollers. – Lundin Jul 16 '22 at 17:15
Whether or not such location directives:
will improve execution/performance in a way
is specific to the particular hardware and application. The linker script defines memory regions/sections and inspecting that will reveal exactly where the specified section will be located and may therefore reveal why this was done.
For example an STM32F7xx (ARM Cortex-M7) may have memory on different buses - internally it has SRAM1, SRAM2, DTCM, and additionally a specific board may have external SRAM or QSPI RAM - each having different speed and timing.
The typical configuration of a default linker script is to allow the linker to locate any data in any addressable region specified in the link map. However in a complex memory environment such as that described above the behaviour and performance of the code will vary considerably depending on where data objects are located, and for time critical code especially that would be undesirable - unrelated code changes could cause time critical data objects to be relocated to slower memory.
In other cases such as SRAM1 and SRAM2 in the above example the performance may be identical, but because the they are on separate buses, using the smaller SRAM2 region for DMA operations, minimises bus contention between DMA bus masters and the CPU which will can also improve performance.

- 88,407
- 13
- 85
- 165