ARM/neon memcpy optimized for uncached memory?

Question

I'm using a Xilinx Zynq 7000 ARM-based SoC. I'm struggling with DMA buffers (Need help mapping pre-reserved **cacheable** DMA buffer on Xilinx/ARM SoC (Zynq 7000)), so one thing I pursued was faster memcpy.

I've been looking at writing a faster memcpy for ARM using Neon instructions and inline asm. Whatever glibc has, it's terrible, especially if we're copying from an ucached DMA buffer.

I've put together my own copy function from various sources, including:

The main difference for me is that I'm trying to copy from an uncached buffer because it's a DMA buffer, and ARM support for cached DMA buffers is nonexistent.

So here's what I wrote:

void my_copy(volatile unsigned char *dst, volatile unsigned char *src, int sz)
{
    if (sz & 63) {
        sz = (sz & -64) + 64;
    }
    asm volatile (
        "NEONCopyPLD:                          \n"
        "    VLDM %[src]!,{d0-d7}                 \n"
        "    VSTM %[dst]!,{d0-d7}                 \n"
        "    SUBS %[sz],%[sz],#0x40                 \n"
        "    BGT NEONCopyPLD                  \n"
        : [dst]"+r"(dst), [src]"+r"(src), [sz]"+r"(sz) : : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "cc", "memory");
}

The main thing I did was leave out the prefetch instruction because I figured it would be worthless on uncached memory.

Doing this resulted in a speedup of 4.7x over the glibc memcpy. Speed went from about 70MB/sec to about 330MB/sec.

Unfortunately, this isn't nearly as fast as memcpy from cached memory, which runs at around 720MB/sec for system memcpy and 620MB/sec for the Neon version (probably slower because my memcpy doesn't do prefetching, perhaps).

Can anyone help me figure out what I can do make up for this performance gap?

I've tried a number of things like copying more at once, two loads followed by two stores. I could try prefetch just to prove that it's useless. Any other ideas?

Is your source a "multiples of the level 1 cache line size"? — David Wohlferd, Jan 20 '16 at 00:10
I've ensured that the data buffers are aligned on 64-byte boundaries and in 64-byte units. (Technically, the end of the last 64-byte unit may get ignored.) — Timothy Miller, Jan 20 '16 at 00:25
Is your uncached buffer located in DRAM? If so it's likely impossible to close the gap. Cache excels at hiding memory latency in this kind of workload. If your buffer size is small enough and bandwidth is a real concern, consider moving to an on-chip memory. — Tony K, Jan 20 '16 at 03:33
in my experience best is experimenting. may be don't use vldm but use single load / store variants, unroll further, do subs earlier. also I would do a non neon version to see if that gets better. sometime neon has its own memory port sometimes not. — auselen, Jan 20 '16 at 12:19
@TonyK So far, the largest data block we may want to transfer is just under 32MB. The chip we're working with is a Xilinx Zynq 7000, and there just isn't enough SRAM in the FPGA fabric. The big memory is the main DRAM. — Timothy Miller, Jan 20 '16 at 13:11
@auselen I've used Sparc and x86 far more than ARM, but in my experience in cases like this, it's important to do large (semi-)atomic transfers. I'll try it, but I'm betting that as soon as I drop below the cache line size in a single transfer, the throughput is going to plummet. — Timothy Miller, Jan 20 '16 at 13:14
well if you use 8 registers, you'll transfter 32 bytes, and that's probably l1, cpu cache line size? — auselen, Jan 20 '16 at 13:45
@auselen I tried doing 16 registers, and the performance went up imperceptibly. I also tried doing two loads of 8 followed by two stores of 8. That improved performance by the same amount. — Timothy Miller, Jan 20 '16 at 14:45
@auselen I'm away from the hardware and can't try that for a little while. I know there's a load multiple instruction. If I load a line's worth of registers, will that be an atomic operation and be a single transaction on the memory bus? The whole problem really is the round-trip latency to access RAM; without prefetching, there's no read-ahead, so the limit is latency, not bandwidth. — Timothy Miller, Jan 20 '16 at 15:23
Here's another thought, have you tried using the DMA instead of a purely software memcpy? I know this question stems from your problem with DMA in the first place, but this could help with latency since you don't have the CPU treating every read from the buffer like a cache miss, and it may help the memory controller to take advantage of the DRAM's burstiness. — Tony K, Jan 20 '16 at 19:10

score 1 · Answer 1 · answered Jan 05 '21 at 15:10

If you're trying to do large, fast transfers, cached memory will often outperform uncached memory, but as you pointed out, support for cached DMA buffer memory must be managed somewhere, and on <=ARMv7, that place is the kernel / kernel-driver.

I'm assuming two things about your design:

Userspace is reading a memory-mapped hardware buffer
There's some sort of signal/event/interrupt from the FGPA to the CortexA9 VIC/GIC that tells the CortexA9 when a new buffer is available to read.

Align your DMA buffers on cacheline boundaries and do not place anything between the end of the DMA buffer and the next cacheline. Invalidate the cache whenever the FPGA signals the CPU that a buffer is ready.

I don't think the A9 has a mechanism to control cachelines on all cores and layers together, so you may wish to pin the program doing this to one core so that you can skip maintaining caches on the other core.

score 0 · Answer 2 · answered Sep 14 '16 at 08:54

0

You can try to use the buffered memory rather than non-cached memory.

answered Sep 14 '16 at 08:54

Horace Hsieh

1

There's an ACP (accelerated coherence port), which we can use to write to DRAM *through* the ARM L2 cache. Unfortunately, there were some problems with that, but it's been so long that I don't remember what exactly the trouble was. – Timothy Miller Sep 15 '16 at 01:05
The purpose of using buffered memory is to enable write-buffer, not only L2 cache. have you resolved the issue? – Horace Hsieh Sep 15 '16 at 02:31

ARM/neon memcpy optimized for uncached memory?

2 Answers2

Linked

ARM/neon memcpy optimized for *uncached* memory?

2 Answers2

Linked

ARM/neon memcpy optimized for uncached memory?