Fast ARM NEON memcpy

Question

I want to copy an image on an ARMv7 core. The naive implementation is to call memcpy per line.

for(i = 0; i < h; i++) {
  memcpy(d, s, w);
  s += sp;
  d += dp;
}

I know that the following

d, dp, s, sp, w

are all 32-byte aligned, so my next (still quite naive) implementation was along the lines of

for (int i = 0; i < h; i++) {
  uint8_t* dst = d;
  const uint8_t* src = s;
  int remaining = w;
  asm volatile (
    "1:                                               \n"
    "subs     %[rem], %[rem], #32                     \n"
    "vld1.u8  {d0, d1, d2, d3}, [%[src],:256]!        \n"
    "vst1.u8  {d0, d1, d2, d3}, [%[dst],:256]!        \n"
    "bgt      1b                                      \n"
    : [dst]"+r"(dst), [src]"+r"(src), [rem]"+r"(remaining)
    :
    : "d0", "d1", "d2", "d3", "cc", "memory"
  );
  d += dp;
  s += sp;
}

Which was ~150% faster than memcpy over a large number of iterations (on different images, so not taking advantage of caching). I feel like this should be nowhere near the optimum because I am yet to use preloading, but when I do I only seem to be able to make performance substantially worse. Does anyone have any insight here?

Try unrolling the loop by at least 2X. NEON loads are not instantaneous due to pipelining and memory speed. If you do 2 loads followed by 2 stores, you should see a benefit. The cache preload can definitely speed things up, but the read-ahead distance needs to be tuned to your target platform. — BitBank, Jun 22 '12 at 17:29
I tried that but the difference was negligible. I followed the same reasoning but bear in mind that those loads and stores are only 2 cycles each ([source](http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0344k/ch16s06s07.html)). Cache line size is 64 bytes, I tried prefetching 64, 128, 192 and 256 bytes ahead, all of which made this considerably slower (2-3 times). — robbie_c, Jun 22 '12 at 17:59
Have you tried looking at memcpy source? Maybe it is already optimized and uses NEON instructions on your platform. — Mārtiņš Možeiko, Jun 22 '12 at 19:41
Prefetching is notoriously difficult to get right and rarely helpful. For memcpy you have no computation cycles to speak of so there probably isn't anything to be gained from prefetching. — Paul R, Jun 22 '12 at 20:07
Have you thought about using the DMA? I don't know how much faster/slower the copy would be, but you could be doing other processing, so your overall app speed may improve? — Josh Petitt, Jun 26 '12 at 01:42
You can get a huge speed-up in the common case where `w` = `dp` = `sp` by detecting that case and doing a single memcpy there. (Or, with a custom line copy algorithm, running that once instead of per-line). — Dan Hulme, Jun 28 '12 at 01:09
@JoshPetitt The code has to be run on an iOS device, I don't think I can access a DMA? — robbie_c, Jul 20 '12 at 10:39
@DanHulme In my use case this never happens. The source is a video decoder's reference frame, which are always padded, so sp != w. The destination is packed such that dp = w, however I don't think this alone gains anything? — robbie_c, Jul 20 '12 at 10:39

Peter M · Accepted Answer · 2013-02-12T16:54:59.770

ARM has a great tech note on this.

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka13544.html

Your performance will definitely vary depending on the micro-architecture, ARM's note is on the A8 but I think it will give you a decent idea, and the summary at the bottom is a great discussion of the various pros and cons that go beyond just the regular numbers, such as which methods result in the least amount of register usage, etc.

And yes, as another commenter mentions, pre-fetching is very difficult to get right, and will work differently with different micro-architectures, depending on how big the caches are and how big each line is and a bunch of other details about the cache design. You can end up thrashing lines you need if you aren't careful. I would recommend avoiding it for portable code.

Fast ARM NEON memcpy

1 Answers1

Linked