Is memcpy accelerated in some way on the iPhone?

Question

Few days ago I was writing some code and I had noticed that copying RAM by memcpy was much-much faster than copying it in for loop.

I got no measurements now (maybe I did some time later) but as I remember the same block of RAM which in for qas copied in about 300 ms or more by memcpy was copied in 20 ms or less.

It is possible, is memcpy hardware acelerated?

Loop penalties removed? May be some code snippet would help to answer your question more nicely. — , Apr 24 '11 at 18:51

score 2 · Answer 1 · answered Apr 24 '11 at 18:52

2

Well, I can't speak about Apple's compilers, but gcc definitely treats memcpy as a builtin.

answered Apr 24 '11 at 18:52

C. K. Young

219,335
46
382
435

score 2 · Answer 2 · answered Apr 24 '11 at 18:57

2

The built-in implementation of memcpy tends to be optimized pretty heavily for the platform in question, so it will usually be faster than a naive for loop.

Some optimizations include copying as much as possible at a time (not single bytes but rather whole words, or if the processor in question supports it, even more), some degree of loop unrolling, etc. Of course the best course of optimization depends on the platform, so it's usually best to stick to the built-in function.

In most cases it's written by way more experienced people than the user anyways.

answered Apr 24 '11 at 18:57

Matti Virkkunen

63,558
9
127
159

For a small amount of memory (say less than 16 words), you can write a short loop or your own ASM and get a small speedup due to the overhead of invoking memcpy and the alignment checks it needs to do. But, for a whole page or more you will not be able to beat the builtin memcpy since it contains ARM or NEON optimizations specific to the processor in question. If you copy megs of data, just use memcpy and dont worry about the details. – MoDJ Jul 04 '13 at 20:45

score 1 · Answer 3 · answered Apr 24 '11 at 18:58

1

Sometimes mem-to-mem DMA is implemented in processors so, yes, if such a thing exists in the iPhone, then it's likely that memcpy( ) takes advantage of it. Even if it were not implemented, I'm not surprised by the 15-to-1 advantage that memcpy( ) seems to have over your character-by-character copy.

Moral 1: always prefer memcpy( ) to strcpy( ) if possible.
Moral 2: always prefer memmove( ) to memcpy( ); always.

answered Apr 24 '11 at 18:58

Pete Wilson

8,610
6
39
51

i will do some measurements later and wil post the results, couse as i said i remember over 10 x speed up – grunge fightr Apr 24 '11 at 19:04
yes, as i said, i did some test and here it is `for(int i=0; i<1000000; i++) data_a[i]=data_b;` is 60 miliseconds on my iphone 3gs `memcpy(data_b, data_a, 1000000 );` is about 3-6 miliseconds; so it is suprasingly much for me (under the pc and windows i never had not noticed such difference) – grunge fightr Apr 25 '11 at 09:01

score 1 · Answer 4 · edited May 23 '17 at 11:48

1

The newest iPhone has SIMD instructions on the ARM chip allowing for 4 calculations at the same time. This includes moving memory around.

Also, if you create a highly optimized memcpy, you'd typically unroll loops to a certain amount, and implement it as a duffs device

edited May 23 '17 at 11:48

Community

1
1

answered Nov 19 '11 at 22:01

Toad

15,593
16
82
128

score 0 · Answer 5 · answered Apr 24 '11 at 18:59

0

It looks like the ARM CPU has instructions that can copy 48 bits per access. I'd bet the lower overhead of doing it in larger chunks is what you're seeing.

answered Apr 24 '11 at 18:59

Jay

13,803
4
42
69

it was more han 10 x, though my loop was slow - it was double-for loop with things like `cameraBitsBuffer[y][x][0] = baseAddress[yt+(x<<2)+0];` (four times: for R G B A) and it was terribly slow and memcpy was fast – grunge fightr Apr 24 '11 at 19:11
if you have array lookups, you are in essence performing a multiplication. This is much slower than just moving data around. – Toad Nov 19 '11 at 22:02

Is memcpy accelerated in some way on the iPhone?

5 Answers5