0

I am facing an issue implementing memcpy(src, dst, sz); for NEON

Since there is no cached memory in DMA of ARM SoC, it slows down a lot to copy from DMA.

void my_copy(volatile unsigned char *dst, volatile unsigned char *src, int sz)
{
    if (sz & 63) {
        sz = (sz & -64) + 64;
    }
    asm volatile (
        "NEONCopyPLD:                          \n"
        "    VLDM %[src]!,{d0-d7}                 \n"
        "    VSTM %[dst]!,{d0-d7}                 \n"
        "    SUBS %[sz],%[sz],#0x40                 \n"
        "    BGT NEONCopyPLD                  \n"
        : [dst]"+r"(dst), [src]"+r"(src), [sz]"+r"(sz) : : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "cc", "memory");
}

This is a code for ARMv7 by @Timothy Miller ARM/neon memcpy optimized for *uncached* memory?

And since there are no VLDM and VSTM in ARM64 instruction sets,

I am using LD and ST. However, it is as slow as memcpy() in C.

"NEONCopyPLD: \n"
"ld4 {v0.16b, v1.16b, v2.16b, v3.16b}, [%[src]], #64 \n"
"st4 {v0.16b, v1.16b, v2.16b, v3.16b}, [%[dst]], #64 \n"
"SUBS %[sz], %[sz],#0x40\n"
"BGT NEONCopyPLD \n"

is there a better way instead of using LD& ST in ARM64?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Sung
  • 1,036
  • 1
  • 8
  • 22
  • That is an ARMv7 not an ARM7 but anyway... – old_timer Apr 14 '20 at 14:51
  • 1
    Note that you shouldn't use plain old labels in inline assembly as the inline assembly snippet might be instantiated more than once in your program. Instead, use `%=` to make the label names unique. Also consider prefixing your label names with `.L` so they don't appear in the symbol table and don't confuse any debug tools. – fuz Apr 14 '20 at 14:58
  • 1
    there is absolutely no reason for using `ld4` and `st4`. replace them with `ld1` and `st1` each. move `subs` one line upward. it should be `b.gt` – Jake 'Alquimista' LEE Apr 16 '20 at 10:25

1 Answers1

2

aarch64 features memory operations for uncached area. (non-temporal)

Below is what I suggest:

"NEONCopyPLD: \n"
"sub %[dst], %[dst], #64 \n"
"1: \n"
"ldnp q0, q1, [%[src]] \n"
"ldnp q2, q3, [%[src], #32] \n"
"add %[dst], %[dst], #64 \n"
"subs %[sz], %[sz], #64 \n"
"add %[src], %[src], #64 \n"
"stnp q0, q1, [%[dst]] \n"
"stnp q2, q3, [%[dst], #32] \n"
"b.gt 1b \n"

for cached area:

"NEONCopyPLD: \n"
"sub %[src], %[src], #32 \n"
"sub %[dst], %[dst], #32 \n"
"1: \n"
"ldp q0, q1, [%[src], #32] \n"
"ldp q2, q3, [%[src], #64]! \n"
"subs %[sz], %[sz], #64 \n"
"stp q0, q1, [%[dst], #32] \n"
"stp q2, q3, [%[dst], #64]! \n"
"b.gt 1b \n"
Jake 'Alquimista' LEE
  • 6,197
  • 2
  • 17
  • 25