I am using memcpy() in my program. as I increase the number of variables, unfortunately the CPU usage increases. it is as if memcpy is run by using for loop iteration. is there a fast memcpy function in linux too? shall I use a patch and compile the kernel?
-
3Huh? Windows is using DMA for memcpy? :) – Joachim Isaksson Jun 18 '13 at 11:37
-
as far as I know it is DMA – Sam Jun 18 '13 at 11:42
-
1DMA assumes communication with other device;> – matekm Jun 18 '13 at 11:56
-
You may want to take a look [here](http://stackoverflow.com/questions/4260602/how-to-increase-performance-of-memcpy) for some ideas. – Joachim Isaksson Jun 18 '13 at 12:05
-
1I think it is fair to assume a linear complexity for `memcpy()`, isn't it? – rectummelancolique Jun 18 '13 at 12:05
-
as I am writing a opengl display program, I use memcpy to update every angle row of the display, it takes CPU usage. can I optimize it? according to this link http://www.gossamer-threads.com/lists/linux/kernel/1458461 – Sam Jun 18 '13 at 12:06
-
as the link says if I compile the fast memcpy in the kernel, why would not my program use the optimized one in the kernel then? besides, what algorithm could I replace in the current situation? – Sam Jun 18 '13 at 12:16
1 Answers
There are architectures where the bus between the CPU and memory is rather weak; some of those architectures add a DMA engine to allow big blocks of memory to be copied without having a loop running on the CPU.
In Linux, you would be able to access the DMA engine with the dmaengine
subsystem, but it is very hardware-dependent whether such an engine is actually available.
X86 CPUs have a good memory subsystem, and also have special hardware support for copying large blocks, so using a DMA engine would be very unlikely to actually help. (Intel added a DMA engine called I/OAT to some server boards, but the overall results were not much better than plain CPU copies.)
DMA forces the data out of the CPU caches, so doing DMA copies for your program's variables would be utterly pointless because the first CPU access afterwards would have to read them back into the cache.

- 173,858
- 17
- 217
- 259