4

Few days ago I was writing some code and I had noticed that copying RAM by memcpy was much-much faster than copying it in for loop.

I got no measurements now (maybe I did some time later) but as I remember the same block of RAM which in for qas copied in about 300 ms or more by memcpy was copied in 20 ms or less.

It is possible, is memcpy hardware acelerated?

Jakub Hampl
  • 39,863
  • 10
  • 77
  • 106
grunge fightr
  • 1,360
  • 2
  • 19
  • 38
  • Loop penalties removed? May be some code snippet would help to answer your question more nicely. –  Apr 24 '11 at 18:51

5 Answers5

2

Well, I can't speak about Apple's compilers, but gcc definitely treats memcpy as a builtin.

C. K. Young
  • 219,335
  • 46
  • 382
  • 435
2

The built-in implementation of memcpy tends to be optimized pretty heavily for the platform in question, so it will usually be faster than a naive for loop.

Some optimizations include copying as much as possible at a time (not single bytes but rather whole words, or if the processor in question supports it, even more), some degree of loop unrolling, etc. Of course the best course of optimization depends on the platform, so it's usually best to stick to the built-in function.

In most cases it's written by way more experienced people than the user anyways.

Matti Virkkunen
  • 63,558
  • 9
  • 127
  • 159
  • For a small amount of memory (say less than 16 words), you can write a short loop or your own ASM and get a small speedup due to the overhead of invoking memcpy and the alignment checks it needs to do. But, for a whole page or more you will not be able to beat the builtin memcpy since it contains ARM or NEON optimizations specific to the processor in question. If you copy megs of data, just use memcpy and dont worry about the details. – MoDJ Jul 04 '13 at 20:45
1

Sometimes mem-to-mem DMA is implemented in processors so, yes, if such a thing exists in the iPhone, then it's likely that memcpy( ) takes advantage of it. Even if it were not implemented, I'm not surprised by the 15-to-1 advantage that memcpy( ) seems to have over your character-by-character copy.

Moral 1: always prefer memcpy( ) to strcpy( ) if possible.
Moral 2: always prefer memmove( ) to memcpy( ); always.

Pete Wilson
  • 8,610
  • 6
  • 39
  • 51
  • i will do some measurements later and wil post the results, couse as i said i remember over 10 x speed up – grunge fightr Apr 24 '11 at 19:04
  • yes, as i said, i did some test and here it is `for(int i=0; i<1000000; i++) data_a[i]=data_b;` is 60 miliseconds on my iphone 3gs `memcpy(data_b, data_a, 1000000 );` is about 3-6 miliseconds; so it is suprasingly much for me (under the pc and windows i never had not noticed such difference) – grunge fightr Apr 25 '11 at 09:01
1

The newest iPhone has SIMD instructions on the ARM chip allowing for 4 calculations at the same time. This includes moving memory around.

Also, if you create a highly optimized memcpy, you'd typically unroll loops to a certain amount, and implement it as a duffs device

Community
  • 1
  • 1
Toad
  • 15,593
  • 16
  • 82
  • 128
0

It looks like the ARM CPU has instructions that can copy 48 bits per access. I'd bet the lower overhead of doing it in larger chunks is what you're seeing.

Jay
  • 13,803
  • 4
  • 42
  • 69
  • it was more han 10 x, though my loop was slow - it was double-for loop with things like `cameraBitsBuffer[y][x][0] = baseAddress[yt+(x<<2)+0];` (four times: for R G B A) and it was terribly slow and memcpy was fast – grunge fightr Apr 24 '11 at 19:11
  • if you have array lookups, you are in essence performing a multiplication. This is much slower than just moving data around. – Toad Nov 19 '11 at 22:02