I developed a naive function for mirroring an image horizontally or vertically using CUDA C++.
Then I came to know that NVIDIA Performance Primitives Library also offers a function for image mirroring .
Just for the sake of comparison, I timed my function against NPP. Surprisingly, my function outperformed (although by a small margin, but still...).
I confirmed the results several times by using Windows timer, as well as CUDA Timer.
My question is that: Aren't NPP functions completely optimized for NVIDIA GPUs?
I'm using CUDA 5.0, GeForce GTX460M (Compute 2.1), and Windows 8 for development.