I am looking for something to copy a 2D array into another (larger) 2D array extremely fast, using SSD/MMX/3DNow/SIMD (Whatever). I do not want to implement myself, just looking for a high-optimized supported and maintained solution. I am using Clang(++) on Linux.
memcyp2Di(int *src, int *dest, int srcw, int srch, int destw, int desth, int destx, int desty)