C++ on embedded targets: Low overhead storage backend

Question

I'm in the process of coding a reusable C++ module for an ARM Cortex-M4 processor. The module uses a lot of storage to accomplish its task and it's time-critical.

To allow the users of my module to customize its behavior, I'm using different backend-classes to allow for different low-level task implementations. One of these backends is a storage backend, that is meant to be a way to store the actual data in different types of volatile/non-voltile RAM. It consists mostly of set/get functions that are very fast to execute and they will be called very frequently. They are mostly in this form:

uint8_t StorageBackend::getValueFromTable(int row, int column, int parameterID) 
{
    return table[row][column].parameters[parameterID];
}

uint8_t StorageBackend::getNumParameters() { return kNumParameters; }

The underlying tables and arrays have a sizes and datatypes that depend on the user-defined functionality, so there is no way for me to aviod using a storage backend. One primary concern is the need to put the actual data into a certain section of the RAM address space (e.g. for using an external RAM) and I don't want to limit my module to a specific storage option.

Now I'm wondering what design pattern to choose for separating the storage aspects from my main module.

A class with virtual functions would be a simple yet powerful option. However, I fear the cost of calling virtual set/get functions very often in a time-critical environment. Especially for a storage backend this can be a serious problem.
Supplying the modules main class with template parameters for its different backends (maybe even with the CRTP-pattern?). This would avoid virtual functions and would even allow to inline the set/get-functions of the storage backend. However, it would require the whole main class to be implemented in the header file which isn't particularly tidy...
Use simple C-style functions to form the storage backend.
Use macros for the simple set/get functions (after compilation this should roughly be the same as option 2 with all set/get-functions inlined.)
Define the storage datastructures myself and allow customization by using macros as the datatypes. E.g. RAM_UINT8 table[ROWSIZE][COLSIZE] with the user adding #define RAM_UINT8 __attribute__ ((section ("EXTRAM"))) uint8_t The downside of this is that it requires all data to sit in the same, continuous section of RAM - which is not always possible on an embedded target.

I wonder if there are more options? Right now, I'm leaning towards option 4, as its tidy enough yet it has zero influence on the actual run-time performance.

To sum it up: What's the best way to implement a low/zero-overhead storage abstraction layer on a Cortex-M4?

Cost of virtual functions is often exaggerated. Pick a simplest method and **do a benchmark**. — user694733, Feb 23 '16 at 10:23
Embedded C++ has been abandoned for some time now. There is no official embedded version of the C++ language. — too honest for this site, Feb 23 '16 at 12:53
Actually, all I wanted to say is: C++ on an embedded target. I wasn't aware that there has been an embedded version of the language. I changed the title — Johannes Neumann, Feb 23 '16 at 14:07
`table[row][column].parameters[parameterID]` - That's quite some overhead to retrieve a single byte. And probably not even necessary/benefitial: If your "storage" is accessible via normal pointer dereferencing, why not provide some `malloc`-type of function and let the user acquire a pointer to the storage and operate on it at his/her discretion? — JimmyB, Jul 17 '18 at 14:39
I would, btw, recommend looking into template parameters (current personal taste), for the efficiency of compile-time polymorphism &c. — JimmyB, Jul 17 '18 at 14:41

score 1 · Answer 1 · answered Feb 24 '16 at 01:29

A virtual member generally boils down to a single extra lookup(if that). The vtable (a common implementation method) for virtual functions is normally easily reachable from the 'this' pointer using instructions that aren't larger than what's normally there to load a known fixed address to a statically linked function.

Given that you're already doing

row*column + offset + size*parameter

(assuming you haven't overloaded any operators) and you're calling a function that's getting passed 3 parameters (which all need to be loaded), that's a pretty small bit of overhead, if any.

But, that's not to say the overhead of calling a function isn't going to burn you if you're doing lots and lots of accesses. The answer to that, though, is allowing you to retrieve multiple values at a time.

rel · Answer 2 · 2018-07-16T19:54:43.787

In my experience, language features rarely help with solving concrete problems. They can improve maintainability, readability and modularity of code. Make it more elegant and nice, sometimes more efficient, but personally I wouldn't rely on the language features nor the compiler too much, especially on a microcontroller.

So, personally, I would tend to solutions similar to those listed as 3/4/5 above. I would avoid getting into overly complicated template and OOP patterns (at first), and instead try to find the actual bottleneck of a "table module" like this, by making tests and measuring its real performance. And get more control over the actual memory layout and access operations. And try to keep it simple. :)

Not sure, if this solves your problem, but here some general thoughts on this topic:

Flat structure: Instead of using a multi-dimensional array, you could use a flat memory structure. That way the access to individual entries can be optimized for speed more easily, and you have full control over the data layout. Even more so, if all data elements are of fixed, equal size.
Fixed, power-of-two sizes: In order to speed things up, you could use 2^n sized table entries, which probably results in faster access by using bit-shifts/-wise operations instead of multiplications/etc (row and entry size of a power-of-two number of entries/bytes, e.g. a table entry size of 256 bytes, with 64 x 32-bit elements). Assuming, your application allows this, you could round the size of the table entries up to the next power-of-two, and leave some bytes unused - speed vs size.

With a fixed power-of-two sized table, the array access could be written explicitly as an addition of pointers, so that the code resembles more closely what the processor should actual do. Only worth considering in performance critical parts (more a matter of taste - the compiler will probably do the same thing when array notation is used):

   //return table[row][column].parameters[parameterID];

   //const entry *e = table + column * table_width + row;
   //return entry->parameterID;

   //#define COL(col) ((col) * ROW_SIZE)
   //#define ROW(row) ((row) * ENTRY_SIZE)
   //#define PARAM(param) ((param) * PARAM_SIZE)
   #define COL(col) ((col) << SHIFT_COL_SIZE)
   #define ROW(row) ((row) << SHIFT_ROW_SIZE)
   //#define PARAM(param) ((param) << SHIFT_PARAM_SIZE) // (PARAM_SIZE == 4)?

   param *p = table + COL(column) + ROW(row) + parameterID; //PARAM(parameterID);
   // Do something with p? Return p instead of *p?
   return *p;

This only works when the table dimensions are known at compile time, so you probably need a more dynamic solution, and recalculate the increments/number of bit-shifts when the size of the table changes. Maybe the table entry and parameter size can be fixed, so that only the row/column sizes/shifts would not need to be known at compile time?

inlineing functions might help, to reduce the function call overhead.
Batch: Doing multiple accesses in a sequence is probably more efficient than accessing individual entries. You can use pointer arithmetic to do this.
Memory alignment: Align all entries to 4-bytes words and make the entries not smaller than 4 bytes. It helps the STM32 with memory access, as far as I've experienced.
DMA: Use of memory to memory DMA might help a lot with speed.
FMC peripheral of the STM32F4x: If you use external SDRAM, things could be tweakd by using different timing parameters (FMC). There might be useful bits of code in the HAL_SDRAM_*() functions provided by ST.
Cache: Since the Cortex-M4 does not have a data/instruction cache (AFAIK), all the magic cache voodoo can be safely ignored. :)
(Data structure: Depending on the nature of your data and access methods, a different data structure might be useful. If the table might be resized at runtime, and if random access isn't that important, linked lists might be interesting instead. Or hash tables might be worth a look.)

C++ on embedded targets: Low overhead storage backend

2 Answers2