I have been searching high and low looking for a way to allocate DMA buffers that hardware can write to, but are cache-able for reads from the CPU. I started by using the Linux command line option mem=896 to reserve the last 128 MB of RAM for DMA buffers (yes, this is excessive). Then in a driver I wrote, I do the following:
void *srcBuf = NULL;
void *dstBuf = NULL;
dma_declare_coherent_memory(&gDev, BUFFER_ADDR, BUFFER_ADDR, 128*1024*1024, DMA_MEMORY_MAP);
srcBuf = dma_alloc_coherent(&gDev, 10*1024*1024, &dmaSrcAddr, GFP_DMA);
dstBuf = dma_alloc_coherent(&gDev, 10*1024*1024, &dmaDstAddr, GFP_DMA);
This correctly allocates 128 MB at BUFFER_ADDR (end of physical RAM) and then gets two 10 MB buffers from that area. I then do some simple memset, memcpy code to test bandwidth:
start = ktime_get();
memset(srcBuf, 0x55, BUFFER_SIZE);
stop = ktime_get();
printPerformance(start.tv64, stop.tv64, "Write performance to memory");
start = ktime_get();
memcpy(dstBuf, srcBuf, BUFFER_SIZE);
stop = ktime_get();
printPerformance(start.tv64, stop.tv64, "Copy performance from src to dst");
This yields a terrible 919 MB/sec for memset and 125 MB/sec for memcpy. This design is simple to a Zedboard, I have two 16-bit 533 MHz DDR3L parts. I should have a bandwidth of (533*2*4) of 4264 MB/sec.
I then do the same thing from user space, but use posix_memalign. This gives me 2461 MB/sec for memset and 298 MB/sec for memcpy. Better, but still terrible.
Finally I wrote a bare-metal app to do the same thing. Although I didn't allocate memory, I have no MMU running, so I just picked two address and performed my memset/memcpy tests. I also ran the bare-metal test with both L1 and L2 cache enabled, L1 cache enabled, but L@ disabled, and both caches disabled. This results in the following: Both enabled memset - 5296 MB/sec memcpy - 637 MB/sec L2 disabled memset - 1426 MB/sec memcpy - 834 MB/sec Both disabled memset - 426 MB/sec memcpy - 276 MB/sec
The performance of userspace code is the same as both caches disabled. The performance of kernel code is half of that. Anyone have any ideas?
Here is the some of the info I've been looking at: https://lwn.net/Articles/440221/ https://aelseb.wordpress.com/2015/04/11/contiguous-memory-on-arm-and-cache-coherency/ https://lkml.org/lkml/2015/5/20/715 http://lists.infradead.org/pipermail/linux-arm-kernel/2013-September/197780.html