Is there something like extremely optimized memcpy2d in C/C++?

Question

I am looking for something to copy a 2D array into another (larger) 2D array extremely fast, using SSD/MMX/3DNow/SIMD (Whatever). I do not want to implement myself, just looking for a high-optimized supported and maintained solution. I am using Clang(++) on Linux.

memcyp2Di(int *src, int *dest, int srcw, int srch, int destw, int desth, int destx, int desty)

You should be able to do `memcpy` in a loop and get acceptable performance. — Mark Ransom, Feb 27 '14 at 02:45
I think the right path would be using the blitting functions available in linux direct rendering infrastructure. They will use some fast implementation, possibly offloading to the graphics card, or to some DMA engine. If DRI is undesirable because of NVIDIA proprietary drivers lack of support, using cairo surfaces or OpenGL would be a hardware independent (and even cross-platform) solution. I have no experience with those, so I can't give a direct answer, I hope this helps though. — hdante, Feb 27 '14 at 03:26
@hdante, sending the data to your video board is SLOW, and sending it back to your main memory is even slower. That would not buy you anything in this situation. — Alexis Wilke, Feb 27 '14 at 04:22

score 5 · Answer 1 · answered Feb 27 '14 at 02:53

5

Take a look at Asmlib by Agner Fog, it provides an extremely optimized version of memcpy and other common libc functions written in assembly and using the best SIMD instruction set available in your CPU, from basic SSE all the way up to the latest AVX2 and FMA3 instructions found in Haswell processors, for instance.

answered Feb 27 '14 at 02:53

asamarin

1,544
11
15

He also made a table comparing e.g. memcpy for different compilers and OS's. GCC's memcpy from glib was slow. The intrnsic memcpy was much worse. I don't know if it's improved yet. http://stackoverflow.com/questions/855895/intrinsic-memcmp/6334452#6334452 – Z boson Feb 28 '14 at 08:24
@Zboson do you mean the table in page 4 [here](http://agner.org/optimize/asmlib-instructions.pdf)? It clearly shows that asmlib beats'em all; the compilers and libraries used in the comparison are indeed a bit outdated, though. – asamarin Feb 28 '14 at 16:11
@asamarin, yeah, I was agreeing with you. By intrinsic memcpy I mean the intrinsic memcpy built into GCC, it's even worse than the one from glib. – Z boson Feb 28 '14 at 16:29
@Zboson Oh, sorry! I thought you meant [this](http://www.liranuna.com/sse-intrinsics-optimizations-in-popular-compilers/) intrinsics (i.e. C/C++ wrappers around the assembly instructions and datatypes). Confusing name, haha :D – asamarin Feb 28 '14 at 17:08

score 1 · Accepted Answer · answered Feb 27 '14 at 04:26

There is the Intel IPP library. It is used to do things such as math computations on large matrices, but I am pretty sure there are copy functions too. The library initializes itself to make use of the fastest version of each function depending on your processor and they keep it up to date so when new processors come out, they eventually implement the functions with new instructions to make things even faster.

Is there something like extremely optimized memcpy2d in C/C++?

2 Answers2