1

The documentation on the Streaming DMA API mentions that in order to ensure consistency, the cache needs to be flushed before dma-mapping to device, and invalidated after unmapping from device.

However, I confused if the flush and invalidate needs to be performed explicitly, i.e., Do the functions dma_map_single() & dma_sync_single_for_device() already take care of flushing the cachelines, or does the driver develop need to call some function to explicitly flush the cachelines of the dma buffer? Same goes for dma_unmap_single() & dma_sync_single_for_cpu()..do these 2 functions automatically invalidate the dma-buffer cache lines?

I skimmed through some existing drivers that use streaming dma and I can't see any explicit calls to flush or invalidate the cachelines.

I also went through the kernel source code and it seems that the above mentioned functions all 'invalidate' the cachelines in their architecture specific implementations, which further adds to my confusion..e.g., in arch/arm64/mm/cache.S

SYM_FUNC_START_PI(__dma_map_area)
    add x1, x0, x1
    cmp w2, #DMA_FROM_DEVICE
    b.eq    __dma_inv_area
    b   __dma_clean_area
SYM_FUNC_END_PI(__dma_map_area)

Can someone please clarify this? Thanks.

szs
  • 59
  • 1
  • 3
  • 2
    Mapping functions will flush caches on the CPU side. In case you update the buffer in between, you need to sync it to device, or if device has new data coming, you need to sync it to the CPU. You may avoid all of these by using DMA coherent area. – 0andriy Oct 27 '21 at 21:48
  • So.. I don't need to call functions like dma_cache_inv() or dma_cache_wb() to ensure consitency..just map()/unmap() or sync operations. Thanks for clarifying. I am contrained to use DMA streaming in attempts to improve performance. – szs Oct 29 '21 at 06:34
  • 1
    When you *map* the area, no need to flush caches, when you *re-use* that memory, you have to be sure that data is actual, which is done by DMA sync API calls. They will flush caches if needed, but in some platforms you might need an additional work. It’s all architecture dependent. – 0andriy Oct 29 '21 at 10:50

1 Answers1

0

So, based on received comments and some more findings, I thought to answer this question myself for others with similar queries. The following is specific for ARM64 architecures. Other architectures may have a slightly different implementation.

When using the Streaming DMA API, one does NOT have to explicitly flush or invalidate the cachelines. Functions dma_map_single(), dma_sync_single_for_device(), dma_unmap_single(), dma_sync_single_for_cpu() take care of that for you. E.g. dma_map_single() and dma_sync_single_for_device() both end up calling architecture dependent function __dma_map_area

ENTRY(__dma_map_area)
    cmp w2, #DMA_FROM_DEVICE
    b.eq    __dma_inv_area
    b   __dma_clean_area
ENDPIPROC(__dma_map_area)

In this case, if the direction specified is DMA_FROM_DEVICE, then the cachelines are invalidated (because data must have come from the device to memory and the cachelines need to be invalidated so that any read from CPU will fetch the new data from memory). If direction is DMA_TO_DEVICE/BIDIRECTIONAL then a flush operation is performed (because data could have been written by the CPU and so the cached data needs to be flushed to the memory for valid data to be written to the device). NOTE that the 'clean' in __dma_clean_area is ARM's nomenclature for cache flush.

Same goes for dma_unmap_single() & dma_sync_single_for_cpu() which end up calling __dma_unmap_area() which invalidates the cachelines if the direction specified is not DMA_TO_DEVICE.

ENTRY(__dma_unmap_area)
    cmp w2, #DMA_TO_DEVICE
    b.ne    __dma_inv_area
    ret
ENDPIPROC(__dma_unmap_area)

dma_map_single() and dma_unmap_single() are expensive operations since they also include some additional page mapping/unmapping operations so if the direction specified is to remain constant, it is better to use the dma_sync_single_for_cpu() and dma_sync_single_for_device() functions.

On a side note, for my case, using Streaming DMA resulted in ~10X faster read operations compared to Coherent DMA. However, the user code implementation gets a little more complicated because you need to ensure that the memory is not accessed by the cpu while dma is mapped to the device (or that the sync operations are called before/after cpu access).

szs
  • 59
  • 1
  • 3