Improve performance of reading volatile memory

Question

I have a function reading from some volatile memory which is updated by a DMA. The DMA is never operating on the same memory-location as the function. My application is performance critical. Hence, I realized the execution time is improved by approx. 20% if I not declare the memory as volatile. In the scope of my function the memory is non-volatile. Hovever, I have to be sure that next time the function is called, the compiler know that the memory may have changed.

The memory is two two-dimensional arrays:

volatile uint16_t memoryBuffer[2][10][20] = {0};

The DMA operates on the opposite "matrix" than the program function:

void myTask(uint8_t indexOppositeOfDMA)
{
  for(uint8_t n=0; n<10; n++)
  {
    for(uint8_t m=0; m<20; m++)
    {
      //Do some stuff with memory (readings only):
      foo(memoryBuffer[indexOppositeOfDMA][n][m]);
    }
  }
}

Is there a proper way to tell my compiler that the memoryBuffer is non-volatile inside the scope of myTask() but may be changed next time i call myTask(), so I could optain the performance improvement of 20%?

Platform Cortex-M4

How can be execution time improved just removing volatile? I guess the behavior is different (does not work) when the compiler/optimizer is allowed to get rid of that matrix.. — LPs, Feb 09 '17 at 13:45
@LPs - It can improve if the compiler sees the variable isn't written to between reads and just optimizes the reads away... It's not as if the program has to be correct right? :) — StoryTeller - Unslander Monica, Feb 09 '17 at 13:51
@StoryTeller Yes, what I menat is: without volatile probably the code is faster 'cause optimization remove the double loop, which is not what OP wants, I guess — LPs, Feb 09 '17 at 13:53
@LPs - While reading consecutive memory from RAM into core registers you will reduce your reading overhead. By declaring the memory volatile the compiler will not allowe that, as it assumes the memory could change in the while... I guess — Dennis Kirkegaard, Feb 09 '17 at 13:54
The code snip is a simplified version. The true code operates perfectly both with and without the volatile declaration. The difference is the execution time. — Dennis Kirkegaard, Feb 09 '17 at 13:58
Well, maybe can be a cache matter. If yes you could try to invalidate the cache manually, but I'm not an expert on this matter. You can see the assembler to understand what is going on. — LPs, Feb 09 '17 at 13:59
is your compiler actually optimizing that code away? (without the volatile) being global maybe it isnt. — old_timer, Feb 09 '17 at 14:55
`voaltile` means the content may change at _any_ time. Is some mechanism in places that guarantees `memoryBuffer[indexOppositeOfDMA]` portion does not change during the run of `myTask()`? — chux - Reinstate Monica, Feb 09 '17 at 15:14
Does `Do some stuff with memory:` imply code might _write_? or is that just _reading_? — chux - Reinstate Monica, Feb 09 '17 at 15:18
Stepping back, I _think_ the best approach may be not to use `volatile`. If access to the halves of `memoryBuffer[2]` are known to only be done during non-volatile times, the keyword is moot. Seeing the larger application would help. IOW, good question but the best answer depends on unposted additional data like how does other code affect `memoryBuffer[2]`? — chux - Reinstate Monica, Feb 09 '17 at 15:25
@chux If the array isn't defined with volatile any writes to it could be optimized away. The 'system program' reading the memory won't read any change. — 2501, Feb 09 '17 at 15:28
@2501 Comment is true, yet for OP that concern may not apply. IAC I see the post as insufficiently defined. We need to see all the code that accesses `memoryBuffer[2]` and the way access is blocked/locked, etc. BTW UV on your good post I see no reason for the DV. — chux - Reinstate Monica, Feb 09 '17 at 15:32

Mehrwolf · Accepted Answer · 2017-02-10T19:07:13.893

The problem without volatile

Let's assume that volatile is omitted from the data array. Then the C compiler and the CPU do not know that its elements change outside the program-flow. Some things that could happen then:

The whole array might be loaded into the cache when myTask() is called for the first time. The array might stay in the cache forever and is never updated from the "main" memory again. This issue is more pressing on multi-core CPUs if myTask() is bound to a single core, for example.
If myTask() is inlined into the parent function, the compiler might decide to hoist loads outside of the loop even to a point where the DMA transfer has not been completed.
The compiler might even be able to determine that no write happens to memoryBuffer and assume that the array elements stay at 0 all the time (which would again trigger a lot of optimizations). This could happen if the program was rather small and all the code is visible to the compiler at once (or LTO is used). Remember: After all the compiler does not know anything about the DMA peripheral and that it is writing "unexpectedly and wildly into memory" (from a compiler perspective).

If the compiler is dumb/conservative and the CPU not very sophisticated (single core, no out-of-order execution), the code might even work without the volatile declaration. But it also might not...

The problem with volatile

Making the whole array volatile is often a pessimisation. For speed reasons you probably want to unroll the loop. So instead of loading from the array and incrementing the index alternatingly such as

load memoryBuffer[m]
m += 1;
load memoryBuffer[m]
m += 1;
load memoryBuffer[m]
m += 1;
load memoryBuffer[m]
m += 1;

it can be faster to load multiple elements at once and increment the index in larger steps such as

load memoryBuffer[m]
load memoryBuffer[m + 1]
load memoryBuffer[m + 2]
load memoryBuffer[m + 3]
m += 4;

This is especially true, if the loads can be fused together (e.g. to perform one 32-bit load instead of two 16-bit loads). Further you want the compiler to use SIMD instruction to process multiple array elements with a single instruction.

These optimizations are often prevented if the load happens from volatile memory because compilers are usually very conservative with load/store reordering around volatile memory accesses. Again the behavior differs between compiler vendors (e.g. MSVC vs GCC).

Possible solution 1: fences

So you would like to make the array non-volatile but add a hint for the compiler/CPU saying "when you see this line (execute this statement), flush the cache and reload the array from memory". In C11 you could insert an atomic_thread_fence at the beginning of myTask(). Such fences prevent the re-ordering of loads/stores across them.

Since we do not have a C11 compiler, we use intrinsics for this task. The ARMCC compiler has a __dmb() intrinsic (data memory barrier). For GCC you may want to look at __sync_synchronize() (doc).

Possible solution 2: atomic variable holding the buffer state

We use the following pattern a lot in our codebase (e.g. when reading data from SPI via DMA and calling a function to analyze it): The buffer is declared as plain array (no volatile) and an atomic flag is added to each buffer, which is set when the DMA transfer has finished. The code looks something like this:

typedef struct Buffer
{
    uint16_t data[10][20];
    // Flag indicating if the buffer has been filled. Only use atomic instructions on it!
    int filled;
    // C11: atomic_int filled;
    // C++: std::atomic_bool filled{false};
} Buffer_t;

Buffer_t buffers[2];

Buffer_t* volatile currentDmaBuffer; // using volatile here because I'm lazy

void setupDMA(void)
{
    for (int i = 0; i < 2; ++i)
    {
        int bufferFilled;
        // Atomically load the flag.
        bufferFilled = __sync_fetch_and_or(&buffers[i].filled, 0);
        // C11: bufferFilled = atomic_load(&buffers[i].filled);
        // C++: bufferFilled = buffers[i].filled;

        if (!bufferFilled)
        {
            currentDmaBuffer = &buffers[i];
            ... configure DMA to write to buffers[i].data and start it
        }
    }

    // If you end up here, there is no free buffer available because the
    // data processing takes too long.
}

void DMA_done_IRQHandler(void)
{
    // ... stop DMA if needed

    // Atomically set the flag indicating that the buffer has been filled.
    __sync_fetch_and_or(&currentDmaBuffer->filled, 1);
    // C11: atomic_store(&currentDmaBuffer->filled, 1);
    // C++: currentDmaBuffer->filled = true;

    currentDmaBuffer = 0;
    // ... possibly start another DMA transfer ...
}

void myTask(Buffer_t* buffer)
{
    for (uint8_t n=0; n<10; n++)
        for (uint8_t m=0; m<20; m++)
            foo(buffer->data[n][m]);

    // Reset the flag atomically.
    __sync_fetch_and_and(&buffer->filled, 0);
    // C11: atomic_store(&buffer->filled, 0);
    // C++: buffer->filled = false;
}

void waitForData(void)
{
    // ... see setupDma(void) ...
}

The advantage of pairing the buffers with an atomic is that you are able to detect when the processing is too slow meaning that you have to buffer more, make the incoming data slower or the processing code faster or whatever is sufficient in your case.

Possible solution 3: OS support

If you have an (embedded) OS, you might resort to other patterns instead of using volatile arrays. The OS we use features memory pools and queues. The latter can be filled from a thread or an interrupt and a thread can block on the queue until it is non-empty. The pattern looks a bit like this:

MemoryPool pool;              // A pool to acquire DMA buffers.
Queue bufferQueue;            // A queue for pointers to buffers filled by the DMA.
void* volatile currentBuffer; // The buffer currently filled by the DMA.

void setupDMA(void)
{
    currentBuffer = MemoryPool_Allocate(&pool, 20 * 10 * sizeof(uint16_t));
    // ... make the DMA write to currentBuffer
}

void DMA_done_IRQHandler(void)
{
    // ... stop DMA if needed

    Queue_Post(&bufferQueue, currentBuffer);
    currentBuffer = 0;
}

void myTask(void)
{
    void* buffer = Queue_Wait(&bufferQueue);
    [... work with buffer ...]
    MemoryPool_Deallocate(&pool, buffer);
}

This is probably the easiest approach to implement but only if you have an OS and if portability is not an issue.

@Mehrwolf - Your answer is very detailed :-). However, you hit the target by the Data Memory Barrier. I'm using the CMSIS library on a Cortex-M4 and call __DMB(). The true figures of leaving out the volatile declaration in my case is an reducion of core utilization from 17.1% to 14.3%. In my type of application that matters! Thankyou :-) — Dennis Kirkegaard, Feb 10 '17 at 07:41

Lundin · Answer 2 · 2017-02-09T15:52:22.173

Here you say that the buffer is non-volatile:

"memoryBuffer is non-volatile inside the scope of myTask"

But here you say that it must be volatile:

"but may be changed next time i call myTask"

These two sentences are contradicting. Clearly the memory area must be volatile or the compiler can't know that it may be updated by DMA.

However, I rather suspect that the actual performance loss comes from accessing this memory region repeatedly through your algorithm, forcing the compiler to read it back over and over again.

What you should do is to take a local, non-volatile copy of the part of the memory you are interested in:

void myTask(uint8_t indexOppositeOfDMA)
{
  for(uint8_t n=0; n<10; n++)
  {
    for(uint8_t m=0; m<20; m++)
    {
      volatile uint16_t* data = &memoryBuffer[indexOppositeOfDMA][n][m];
      uint16_t local_copy = *data; // this access is volatile and wont get optimized away

      foo(&local_copy); // optimizations possible here

      // if needed, write back again:
      *data = local_copy; // optional
    }
  }
}

You'll have to benchmark it, but I'm pretty sure this should improve performance.

Alternatively, you could first copy the whole part of the array you are interested in, then work on that, before writing it back. That should help performance even more.

If your suspect is right, the whole post question and answers are a waste of time.... ;) — LPs, Feb 09 '17 at 15:52

score 1 · Answer 3 · answered Feb 09 '17 at 14:52

1

You're not allowed to cast away the volatile qualifier¹.

If the array must be defined holding volatile elements then the only two options, "that let the compiler know that the memory has changed", are to keep the volatile qualifier, or use a temporary array which is defined without volatile and is copied to the proper array after the function call. Pick whichever is faster.

¹ (Quoted from: ISO/IEC 9899:201x 6.7.3 Type qualifiers 6)
If an attempt is made to refer to an object defined with a volatile-qualified type through use of an lvalue with non-volatile-qualified type, the behavior is undefined.

answered Feb 09 '17 at 14:52

2501

25,460
4
47
87

1

Also, in the rules of pointer conversions (6.3.2.3) conversion from qualified-pointer-to-type to pointer-to-type is not listed as a valid form of conversion. – Lundin Feb 09 '17 at 15:47
@Lundin Hmm, yes. The rule in my answer, uses the term refer, not access, so (depending of what refer means) simply pointing, converting and assigning the pointer, would be undefined. – 2501 Feb 09 '17 at 16:01

score 0 · Answer 4 · answered Feb 09 '17 at 15:14

0

It seems to me that you a passing half of the buffer to myTask and each half does not need to be volatile. So I wonder if you could solve your issue by defining the buffer as such, and then passing a pointer to one of the half-buffers to myTask. I'm not sure whether this will work but maybe something like this...

typedef struct memory_buffer {
    uint16_t buffer[10][20];
} memory_buffer ;

volatile memory_buffer double_buffer[2];

void myTask(memory_buffer *mem_buf)
{
  for(uint8_t n=0; n<10; n++)
  {
    for(uint8_t m=0; m<20; m++)
    {
      //Do some stuff with memory:
      foo(mem_buf->buffer[n][m]);
    }
  }
}

answered Feb 09 '17 at 15:14

kkrambo

6,643
1
17
30

1

The call of `myTask()` will need to cast away the `volatile`, which may be safe if the higher level code "knows" half `double_buffer[]` is non-volatile during the function call. – chux - Reinstate Monica Feb 09 '17 at 15:21
2

This won't work, pointer conversions from pointer-to-volatile-type to pointer-to-type aren't allowed by C. And splitting the buffer in two solves nothing. – Lundin Feb 09 '17 at 15:45

LPs · Answer 5 · 2017-02-09T15:21:16.120

I don't know you platform/mCU/SoC, but usually DMAs have interrupt that trigger on programmable threshold.

What I can imagine is to remove volatile keyword and use interrupt as semaphore for task.

In other words:

DMA is programmed to interrupt when last byte of buffer is written
Task is block on a semaphore/flag waiting that the flag is released
When DMA calls the interrupt routine cange the buffer pointed by DMA for the next reading time and change the flag that unlock the task that can elaborate data.

Something like:

uint16_t memoryBuffer[2][10][20];

volatile uint8_t PingPong = 0;

void interrupt ( void )
{    
    // Change current DMA pointed buffer

    PingPong ^= 1;    
}

void myTask(void)
{
    static uint8_t lastPingPong = 0;

    if (lastPingPong != PingPong)
    {
        for (uint8_t n = 0; n < 10; n++)
        {
            for (uint8_t m = 0; m < 20; m++)
            {
                //Do some stuff with memory:
                foo(memoryBuffer[PingPong][n][m]);
            }
        }

        lastPingPong = PingPong;
    }
}

Toggeling `PingPong` can easily miss an interrupt. I would simply increase it incrementally. Even better would be to use a proper semaphore or condition variable, of course. — Mehrwolf, Feb 09 '17 at 15:22
@Mehrwolf Of course a system semaphore is the best choice, but as I wrote the platform is not specified. BTW if the task is called at least at duble rate of reading DMA time you can be sure that the task trigger each ping pong. — LPs, Feb 09 '17 at 15:25