malloc, free, new, delete
There are various techniques to mitigate issues with dynamic memory allocation. For instance, you can allocate all objects using the 'in-place new' operator, or allocate them all at once during startup using the heap. However, if dynamic memory allocation is unavoidable, there are specialized memory allocator implementations that can be employed.
Real Time Heap Allocator
There are various heap allocation algorithms used across platforms, such as Dlmalloc, Phkmalloc, ptmalloc, jemalloc, Google Chrome's PartitionAlloc, and the glibc heap allocator. While each has its benefits, they aren't tailored for hard real-time environments prioritizing speed, determinism, minimal fragmentation, and memory safety.
The main requirements for Real Time Heap Allocator are:
Predictable Execution Time: The worst-case execution time for the 'malloc, free' and 'new delete C++' functions must be deterministic and independent of application data.
Memory Pool Preservation: The algorithm must strive to minimize the likelihood of exhausting the memory pool. This can be achieved by reducing fragmentation and minimizing memory waste.
Fragmentation Management: The algorithms should effectively manage and reduce external fragmentation, which can limit the amount of available free memory.
Defined Behavior: The allocator must aim to eliminate any undefined behavior to ensure consistency and reliability in its operations.
Functional Safety; The allocator must adhere to the principles of functional safety. It should consistently perform its intended function during normal and abnormal conditions. Its design must consider and mitigate possible failure modes, errors, and faults.
When we talk about 'functional safety'in RTSHA, we are not referring to 'security'. "Functional safety" refers to the aspect of a system's design that ensures it operates correctly in response to its inputs and failures, minimizing risk of physical harm, while "security" refers to the measures taken to protect a system from unauthorized access, disruption, or damage. *
Error Detection and Handling: The allocator should have mechanisms to detect and handle memory allocation errors or failures. This can include robust error reporting, and fallback or recovery strategies in case of allocation failures.
Support for Different Algorithms: The allocator should be flexible enough to support different memory allocation algorithms, allowing it to be adapted to the specific needs of different applications.
Configurability: The allocator should be configurable to suit the requirements of specific platforms and applications. This includes adjusting parameters like the size of the memory pool, the size of allocation blocks, and the allocation strategy.
Efficiency: The allocator should be efficient, in terms of both time and space. It should aim for minimal overhead and quick allocation and deallocation times.
Readability and Maintainability: The code for the allocator should be clear, well-documented, and easy to maintain. This includes adhering to good coding practices, such as using meaningful variable names and including comments that explain the code.
Compatibility: The allocator should be compatible with the system it is designed for and work well with other components of the system.
The Real Time Safety Heap Allocator (RTSHA) I authored, available on GitHub, is an ultra-fast memory management system designed to meet those requirements.
There are several different algorithms that can be used for heap allocation supported by RTSHA:
Small Fix Memory Pages
This algorithm is an approach to memory management that is often used in specific situations where objects of a certain size are frequently allocated and deallocated. By using of uses 'Fixed chunk size' algorithm greatly simplies the memory allocation process and reduce fragmentation.
The memory is divided into pages of chunks(blocks) of a fixed size (32, 64, 128, 256 and 512 bytes). When an allocation request comes in, it can simply be given one of these blocks. This means that the allocator doesn't have to search through the heap to find a block of the right size, which can improve performance. The free blocks memory is used as 'free list' storage. The list is implemented using a standard linked list. However, by enabling the precompiler option USE_STL_LIST, the STL version of the forward list can also be utilized. There isn't a significant performance difference between the two implementations.
Deallocations are also straightforward, as the block is added back to the list of available chunks. There's no need to merge adjacent free blocks, as there is with some other allocation strategies, which can also improve performance.
However, fixed chunk size allocation is not a good fit for all scenarios. It works best when the majority of allocations are of the same size, or a small number of different sizes. If allocations requests are of widely varying sizes, then this approach can lead to a lot of wasted memory, as small allocations take up an entire chunk, and large allocations require multiple chunks.
Small Fix Memory Page is also used internaly by "Power Two Memory Page" and "Big Memory Page" algorithms.
Power Two Memory Pages
This algorithm is a more intricate system that exclusively allows blocks of sizes that are powers of two. This design makes merging free blocks back together easier and significantly reduces fragmentation. The core of this algorithm is based on an array of free lists. Each list corresponds to one group of power-of-two sizes. For instance, there's a dedicated list for 64-byte free blocks, another for 128-byte blocks, and so on. This structured approach ensures that blocks of a specific size are readily available, optimizing memory management and access. This method ensures efficient block allocation and deallocation operations, making the most of the power-of-two size constraint. Utilizing the combination of power-of-two block sizes with an array of free lists and a binary search mechanism, this algorithm strikes a balance between memory efficiency and operational speed. This is a fairly efficient method of allocating memory, particularly useful for systems where memory fragmentation is an important concern. The algorithm divides memory into partitions to try to minimize fragmentation and the 'Best Fit' algorithm searches the page to find the smallest block that is large enough to satisfy the allocation. Furthermore, this system is resistant to breakdowns due to its algorithmic approach to allocating and deallocating memory. The coalescing operation helps ensure that large contiguous blocks of memory can be reformed after they are freed, reducing the likelihood of fragmentation over time. Coalescing relies on having free blocks of the same size available, which is not always the case, and so this system does not completely eliminate fragmentation but rather aims to minimize it.
Measured Performance on Cortex-M7
Based on the results obtained from the system's profiling, here are the performance metrics in terms of CPU cycles for the memory operations:
Small Fix Page:
rtsha_malloc: 204 cycles
rtsha_free: 193 cycles
This represents the time taken for memory allocation and deallocation for smaller fixed-size pages, specifically designed for handling memory chunks less than 512 bytes.
Power2 Page:
rtsha_malloc: 873 cycles
rtsha_free: 636 cycles